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Abstract 


Neural  networks  represent  a  promising  alternative  to  traditional  AI  approaches. 
The  development  of  analog  optical  implementations  of  neural  networks  such  as  the 
multilayer  perceptron  with  learning  by  backward  error  propagation  (BEP)  requires  an 
understanding  of  the  noise  sensitivity  of  such  architectures.  The  objective  of  this 
program  is  to  study  the  effects  of  component  and  system  noise  on  the  performance  of 
such  optical  implementations.  The  method  used  is  computer  simulation. 

In  this  first  phase  of  the  program,  the  one-hidden  layer  perceptron  with  back 
propagation  was  simulated  using  a  simplified,  device-independent  noise  model.  The 
results  point  to  a  distinct  noise  threshold  above  which  the  learning  mechanism  is 
corrupted.  The  efficiency  of  learning  based  on  variations  within  back  propagation  and 
on  the  initializing  method  was  also  studied. 

In  the  next  phase,  device-dependent  noise  models  were  used.  To  this  end  a 
hybrid  optical/electronic  parallel  architecture  capable  of  both  the  forward  pass  and 
backward  error  propagation  steps  of  training  data  presentation  was  conceived  and 
modeled.  The  simulations  showed  that  the  most  significant  effects  were  due  to  the 
SLMs’  nonlinear  response.  The  “noise”  processes  studied  in  the  simulations  included  the 
SLM  nonlinear  response,  SLM  drive  circuit  noise,  nonuniform  SLM  response,  finite  SLM 
contrast  ratio,  optical  crosstalk,  photodiode  shot  and  thermal  noise,  and  CCD  shot  and 
thermal  noise.  It  was  found  that  the  SLMs’  nonlinear  response  significantly  increases  the 
number  of  cycles  needed  for  convergence  in  learning.  Several  solutions  to  this  problem 
were  characterized  in  the  simulations.  Another  conclusion  of  the  simulation  results  was 
that  increasing  the  hidden  layer  size  increases  noise  immunity  significantly. 
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I.  BACKGROUND 


A.  Program  Motivation 

Neural  networks  are  computers  that  are  based  on  organizational  and  functional 
principles  of  information  processing  systems  found  in  nature,  like  the  brain  and  retina. 
Hence,  neural  nets  consist  of  many  analog  processing  elements  that  are  densely 
interconnected  and  they  are  applied  to  problems  such  as  sensor  pre-processing,  pattern 
recognition/classification,  and  motor  control.  In  neural  networks,  long-term  information 
is  stored  as  interconnection  weights  which  signify  the  efficacy  of  interaction  between 
neurons.  Transient  information  is  represented  by  the  neural  outputs,  which,  for  a  given 
neuron,  is  a  nonlinear  function  of  its  state  of  activation.  Each  neuron’s  state  of 
activation  is  determined  by  a  number  of  factors;  its  external  inputs,  which  are  the 
weighted  outputs  from  other  neurons;  its  previous  states  of  activation;  and  other  specific 
and  nonspecific  global  signals.  The  values  of  the  interconnection  weights  change  more 
slowly  than  the  neuron’s  state  of  activation.  The  gradual  evolution  of  the 
interconnection  weights  has  been  widely  postulated  as  the  primary  learning  mechanism 
that  makes  animals  adapt  to  a  constantly  changing  environment.  The  proposed  neural 
net  models  for  problem  solving  attempt  to  emulate  these  intriguing  characteristics  of 
biological  information  processing  systems.  References  1,  2,  and  3  contain  representative 
examples  of  recent  research  on  neural  net  development. 

The  primary  advantage  of  adaptive  neural  net  problem  solving  approaches  over 
conventional  methods  is  that  the  networks  learn  how  to  solve  problems  semi- 
autonomously;  from  labeled  or  unlabeled  training  data,  the  network  learning  rules 
calculate  weights  which  will  produce  the  appropriate  outputs.  Thus,  there  is  no  need  for 
standard  artificial  intelligence  techniques  like  investigation  of  the  nature  of  the  problem 
and  extensive  programming  of  solution  strategies— all  that  is  required  is  access  to  raw 
data.  This  approach  is  especially  useful  in  several  scenarios:  when  designing  systems 
that  can  be  applied  to  a  variety  of  problems  with  little  modification;  when  the  size  and 
complexity  of  the  problem  makes  rule  discovery  by  hand  prohibitively  expensive;  when 
rapid  solution  to  a  problem  is  desired;  when  the  nature  of  the  problem  is  dynamic;  or 
when  it  is  difficult  to  ascertain  the  structure  of  the  problem  due  to  noisy  and/or 
distorted  data. 


Nonadaptive  neural  networks  are  designed  to  solve  specific  problems.  With  this 
approach,  the  appropriate  set  of  weights  and  initial  conditions  are  determined  by  the 
user.  Later,  when  inputs  are  presented,  the  state  of  the  network  converges  to  the  proper 
answer  through  the  network  dynamics.  In  these  networks,  either  extensive  calculations 
must  be  performed  to  find  the  appropriate  set  of  weights  or  a  detailed  prior  knowledge 
of  the  desired  processing  is  needed.  The  network  must,  therefore,  be  used  many  times 
to  justify  the  cost  of  its  construction.'*  Hence,  this  type  of  network  is  most  suitable  for 
sensor  pre-processing  type  applications  where  the  same  operations  must  be  performed  on 
many  data  sets. 

Optical  systems  have  been  proposed  as  candidates  for  neural  network 
implementation  for  a  number  of  reasons.  Foremost  are  the  analog  and  parallel  nature  of 
neural  computation— optical  systems  have  been  employed  for  a  number  of  years  to  solve 
significant  problems,  like  synthetic  aperture  radar  imaging  and  RF  spectrum  analysis, 
with  parallel,  analog  hardware.  In  addition,  neural  nets  often  require  complex  and  dense 
interconnections  between  the  neurons— and  interconnection  and  communication  are  the 
primary  advantages  optics  has  over  electronics.  Finally,  neural  nets  require  analog 
storage  of  interconnection  weights  that  can  be  accessed  and  updated  in  parallel  and 
several  2-D  and  3-D  optical  devices  exist  that  can  provide  this  functionality. 

For  analog  optical  numeric  processors,  the  accuracy  of  the  overall  computation  is 
strongly  dependent  on  the  accuracy  of  the  analog  optical  devices.  When  such  numeric 
processors  were  first  proposed,  the  computations  considered  for  analog  optical  systems 
were  primarily  linear  (matrix  operations)  and  the  accuracy  of  devices  was  quite  low; 
therefore,  so  was  the  accuracy  of  the  overall  processor.  Hence,  the  available  applications 
were  limited  to  those  requiring  low  precision.  This  motivated  the  investigation  of  analog 
optical  systems  for  neural  nets,  which  were  postulated  to  require  low  accuracy 
computation.  The  first  neural  network  that  optics  researchers  chose  to  implement,  the 
Hopfield  model,'^'®  was  indeed  relatively  insensitive  to  inaccuracies  in  the  response  of  the 
analog  components.  As  it  turns  out,  the  very  features  of  the  Hopfield  model  that  make 
it  relatively  insensitive  to  inaccurate  components,  namely  the  particular  type  of 
distributed  and  redundant  storage/computation,  also  limits  its  storage  capacity,  and 
hence,  its  utility. 

Since  the  publication  of  the  Hopfield  model,  there  has  been  a  number  of 
potentially  useful  neural  network  models  reported  in  the  literature  along  with  proposals 
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for  their  optica)  implementation.®  However,  it  has  been  commonly  assumed  that  since 
the  Hopfield  model  was  relatively  impervious  to  hardware  imprecision,  so  are  these  other 
models.  This  is  not  necessarily  true— each  model  must  be  examined  closely  to  determine 
its  own  sensitivity  to  analog  hardware  imprecision.  Since  many  modern  neural  net 
models  improve  upon  the  storage  capacity  of  the  Hopfield  model  by  means  of  less 
redundant  storage  techniques,  they  may  lead  to  systems  which  are  more  sensitive  to 
noise.  Dependence  of  the  noise  sensitivity  on  the  size  of  a  given  neural  net  model  is 
another  issue  of  great  concern  since  the  neural  net  approach  to  problem  solving  may 
become  competitive  with  conventional  approaches  only  when  the  network  is  very  large. 
It  is  clear  from  the  above  discussion  that  the  detailed  study  of  a  neural  network  when 
implemented  in  any  analog  technology  (optical  or  electronic)  is  of  critical  importance. 

B.  Program  Goal 

The  main  goal  of  the  research  program  was  to  study  the  effects  of  component 
and  system  noise  on  the  performance  of  optical  implementations  of  selected  neural  net 
models.  The  noise  could  be  due  to  variations  in  the  response  of  different  components, 
finite  accuracy  in  controlling  signal  amplitudes,  or  signal-  and  time-dependent  noise  due 
to  quantum  and  thermal  fluctuations.  Since  the  size  and  speed  of  neural  net  processors 
will  be  factors  that  influence  their  utility,  the  results  of  this  study  will  be  critical  in 
determining  the  ultimate  applicability  of  optical  neural  nets.  The  identification  of  those 
parts  of  the  neural  net  model  that  are  particularly  sensitive  to  system  noise  will  stimulate 
investigations  into  new  techniques  of  data  representations  and  formatting  to  increase  the 
robustness  of  the  models.  Similarly,  the  identification  of  the  limiting  devices  or 
materials  in  the  optical  implementation  of  the  selected  neural  net  models  will  lead  to 
exploration  of  different  technological  and  architectural  alternatives  for  improved 
performance. 
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II.  TAXONOMY  OF  NEURAL  NETWORKS  AND  TRAINING  METHODS 


A.  Categories  of  Neural  Network  Models  and  Model  Selection  Philosophy 

Neural  network  models  can  be  designed  to  perform  several  operations,  e.g., 
associative  memory,  optimization,  filtering,  pattern  classification.  The  performance 
criteria  for  a  given  model  are  determined  by  its  intended  application.  We  choose  neural 
pattern  classifiers  for  further  investigation  under  this  program.  For  these  models  one 
performance  criterion  is  simply  the  number  of  training  or  test  patterns  the  model 
misclassifies.  Another  is  corruption  of  the  learning  curve  during  training.  The  effect  of 
system  noise  on  the  selected  neural  net  classifiers  is  investigated  using  the  above- 
mentioned  criteria. 

Neural  net  models  can  be  categorized  according  to  several  different  parameters. 
The  first  one  of  these  parameters  is  the  topology  of  the  neural  net  (single  or  multiple 
layers  of  processing  elements).  The  second  one  is  the  processing  element  response 
(linear,  hardclipping  nonlinear,  or  continuous  nonlinear).  The  third  parameter  deals  with 
the  selection  of  learning  algorithm.  Error-driven  algorithms  can  be  used  with  labeled 
training  data  consisting  of  input  patterns  along  with  their  correct  classification. 
Unlabeled  training  data  uses  self-organizing  algorithms  capable  of  autonomously 
clustering  the  input  patterns  into  distinct  classes  and  adjusting  the  internal  parameters  of 
the  neural  net  to  generate  the  desired  classification. 

The  limitations  of  a  single  layer  neural  net  model  in  classification  have  been  well 
documented.  Hence  we  have  selected  a  multilayer  neural  net  classifier.  It  can  be 
readily  seen  that  a  linear  processing  element  response  reduces  a  multilayer  neural  net  to 
an  equivalent  single  layer  neural  net  thereby  suffering  from  the  same  limitations. 
Therefore  we  have  chosen  a  nonlinear  (hardclipping  or  continuous)  response  for  the 
processing  element.  For  the  purpose  of  this  study  we  choose  the  error-driven  learning 
algorithms  that  are  used  with  labeled  training  data.  The  self-organizing  systems  were 
not  selected  because  optical  implementations  for  them  are  relatively  less  developed. 

Figure  1  depicts  a  taxonomy  of  multilayer  feedforward  neural  networks.  As 
shown,  each  algorithm  is  spedific  to  layout  (architecture  type)  and  training  philosophy. 
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Figure  1.  A  taxonomy  of  multilayer  feedforward  neural  networks. 


B.  The  Backward  Error  Propagation  Mouci 

The  most  prominent  multilayer,  nonlinear,  error-driven  neural  net  is 
multilayer  perceptron  trained  by  the  method  of  backward  error  propagation  (BEP).^^ 
This  is  a  least  mean-square  error  approach  that  uses  a  gradient  descent  algorithm.  The 
internal  parameters  of  the  neural  net  (the  connection  weights  and  processing  element 
thresholds)  are  modified  such  that  the  new  weights  lead  to  a  decrease  in  the  total  mean- 
square  error.  This  model  is  a  direct  descendant  of  the  Widrow-Hoff  Adaline  model  that 
was  developed  for  a  linear,  single  layer  neural  net  model.^^  It  has  also  been  successfully 
applied  to  interesting  problems  such  as  distinguishing  underwater  man-made  objects 
from  natural  ones  based  on  their  sonar  returns^*  and  solar  flare  prediction.^^ 

The  BEP  model  uses  a  continuously  differentiable  nonlinear  response  for  the 
processing  element.  Therefore  the  signals  propagating  between  different  layers  are  fully 
analog.  This  could  lead  to  error  accumulation.  This  feature  makes  the  issue  of  system 
noise  particularly  relevant  to  the  BEP  model  and  hence  appropriate  for  this  study.  Fully 
optical  implementations  of  BEP  model  have  been  proposed.^®  Hybrid  optical-digital 
electronic  systems  have  also  been  proposed  in  which  part  of  the  training  procedure  is 
performed  in  an  auxiliary  digital  electronic  processor.  The  effect  of  weight 

quantization  on  the  learning  performance  of  a  BEP  model  has  been  previously 
reported.  These  studies  were  geared  towards  specific  optical  or  electronic 

implementations  in  which  the  connections  were  stored  electronically  in  a  digital 
representation.  The  current  study  extends  that  work  to  quantify  the  effects  of  analog 
system  noise  in  the  weights  as  well  as  in  the  processing  element  computation. 

C.  Exemplar-based  Model 

The  BEP  classifier  is  characterized  by  a  long  training  interval  in  which  the 
training  pairs  have  to  be  presented  as  many  as  several  thousand  times  before  the  ccrrect 
weight  vectors  are  identified  High  computational  accuracy  may  therfore  be  needed  to 
achieve  convergence.  This  learning  procedure  also  can  have  the  tendency  to  get  stuck  in 
a  local  minimum  for  the  total  error  function.  On  the  other  hand,  the  number  of 
processing  elements  (and  hence  the  total  weight  storage  requirement)  is  independent  of 
the  number  of  exemplars,  i.e.,  training  input  patterns.  The  BEP  learning  procedure, 
thus,  makes  efficient  use  of  resources,  in  that  the  number  of  interior  (or  hidden) 
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processing  elements  needed  is  a  function  of  the  decision  space  complexity,  rather  than 
simply  the  quantity  of  training  data. 

Exemplar-based  classifiers  typically  require  many  hidden  processing  elements, 
but  can  be  trained  in  a  relatively  short  time  and  with  low  accuracy.  These  tradeoffs 
suggest  that  the  exemplar-based  classifiers  may  be  more  suited  to  optical  technology  that 
provides  large  storage  capacity  but  poor  computational  accuracy.  For  this  reason,  we 
have  been  examining  exemplar- based  classifiers  for  further  study.  Optical  processors 
can  be  restricted  to  the  first  or  second  layer  of  the  neural  net  performing  the 
computationally  intensive  operations.  High  precision  analog  or  digital  electronics  can  be 
used  in  the  final  layers  that  perform  the  classification. 


7 


III.  SIMULATIONS  OF  BACK  PROPAGATION 


A.  Description  of  Tool 

Throughout  the  program,  we  have  striven  to  understand  the  dynamics  of  a 
multilayer  perceptron  which  is  learning  by  back  propagation  of  errors,  and  to  observe 
the  effects  of  the  various  noise  sources  on  these  dynamics. 

To  these  ends  we  have  been  using  PC-Matlab  as  a  programming  and  analysis 
tool.  PC-Matlab  works  with  1-D  and  2-D  variables.  When  it  is  running  an  “.m”  file 
(a  program),  it  can  display  in  text  or  graphical  form  the  ever-changing  state  of  the 
network.  Information  such  as  a  learning  curve  can  be  stored  in  a  “.mat"  file. 

The  core  of  our  simulation  program  is  found  in  two  modules,  “uinit.m”  and 
“uctrain.m.”  These  modules  are  written  specifically  for  the  case  of  one  hidden  layer, 
although  it  is  simple  to  modify  them  to  add  layers.  Among  other  tasks,  “uinit.m”  loads 
the  training  data,  initializes  the  weights  and  biases,  and  sets  up  the  beginning  of  a 
training  session.  Then  “uctrain.m”  iteratively  updates  the  weights,  tracks  the  mean- 
square  error,  and  provides  for  early  termination  of  training  upon  fulfillment  of  a 
convergence  criterion.  Note  that  a  training  seed  (trseed)  is  used  with  Matlab’s  random 
number  generator,  so  that  an  identical  set  of  “random”  noise  spikes  may  be  used  in 
different  runs,  if  desired. 

These  two  program  modules  run  as  though  the  hidden  layer  size,  initializing 
criteria,  the  learning  rate,  and  the  various  noise  values  are  already  defined  in 
PC-Matlab’s  work  space.  A  “.m”  file  called  “uframe.m”  prompts  for  these  data  so  that 
“uinit.m”  and  “uctrain.m”  may  be  run  without  causing  undefined-variable  errors. 

These  three  “.m”  files,  as  well  as  some  of  the  nested  “.m”  files,  are  listed  in  the 
Appendix. 

B.  Backward  Error  Propagation  Details 

To  provide  a  deeper  understanding  of  the  simulation  tools,  we  shall  now  describe 
exactly  how  back  propagation  proceeds,  giving  due  attention  to  the  updating  procedure. 
Back  propagation  is  defined  for  a  multilayer  perceptron,  a  neural  network  containing 
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nodes  configured  in  an  input  layer,  a  number  of  hidden  layers,  and  an  output  layer.  A 
one-hidden-layer  architecture  is  shown  in  Figure  2.  (Strictly  speaking,  the  input  layer 
units  are  not  full-fledged  nodes.  They  simply  broadcast  the  input  signals  to  the  nodes  in 
the  hidden  layer.  Therefore,  many  call  this  a  /wo-layer  network.) 


Hidden  layer  Output  layer 


Figure  2.  A  perceptron  with  one  hidden  layer.  The  network  has  I  inputs 
(/  =  1  to  /),  J  hidden  processing  units,  and  K  output  processing  units. 
The  weight  lFji(/,y)  refers  to  the  connections  strength  from  input  unit  i  to 
hidden  unit  j,  and  similarly  at  the  next  layer.  Note  that  the  inputs  to  the 
biases  are  held  at  1.  After  summing,  a  given  unit  performs  a  sigmoidal 
nonlinearity. 


The  multilayer  perceptron  learns  iteratively,  each  iteration  having  two  main  steps. 
In  the  first  of  these,  it  processes  the  inputs  via  what  is  commonly  referred  to  as  a 
forward  pass.  The  input  elements  are  multiplied  by  the  weights;  then  they  are  sununed; 
this  sum  is  often  called  the  internal  activation.  It  is  then  transformed  (“thresholding”) 
by  a  nonlinearity  in  the  hidden  nodes.  The  outputs  of  the  hidden  nodes  are  processed  in 
similar  fashion.  The  inputs  and  outputs  (o  terms)  are  unipolar,  due  to  the  thresholding; 
in  some  cases,  the  inputs  may  be  restricted  to  binary  values.  The  weights  and  biases  are 
bipolar  and  continuous. 


In  equation  form,  the  forward  pass  is  written: 


1. 


Oj  =  1/(1  +  exp(-A^e/j)) 


>rward  Pass 


Neti^  Eoiirji+  e-i 
i 

2.  Net^  =  Y.  Ok  =  1/(1  +  exp(-iVelk)) 

j 

The  two  equations  on  the  right  express  a  particular  type  of  thresholding,  namely,  a 
sigmoidal  nonlinearity. 

1 .  Gradient  Descent 

Learning  in  this  context  refers  to  the  modification  of  the  weights  and  biases  in  a 
network.  For  the  network  to  perform  properly,  all  layers  are  required  to  learn. 
Gradient  descent  is  a  training  algorithm  which  iteratively  updates  the  weights  and  biases 
in  a  given  layer  based  on  its  inputs,  outputs,  and  target  outputs.  That  is,  the  weight 
update  is  supposed  to  move  the  weights  in  a  direction  in  weight  space  in  which  the 
mean-square  error  over  all  training  pairs  decreases  the  most.  This  is  expressed 
mathematically  as  a  reduction  in  the  mean-square  error  computed  over  all  components  of 
all  training  pairs: 

AIF  «  -dE/dWui  where  E  =  Y  T.  (trk-Ork)^ 

r  k 

trk  and  Ork  being  the  target  vector’s  and  output  vector’s  kth  element,  respectively,  for  the 
rth  training  pair.  This  is  often  called  the  delta  rule.  The  derivative  can  be  computed  by 
application  of  the  chain  rule: 

af/aiFkj  =  {dE/dNetk)  X  {dNeti^/dWii). 

The  quantity  dNetk/dW^i  is  just  Oj.  The  other  link  is  called  -5k,  and  is  itself 
expressible  as  a  chain: 


dEldNetk  =  -5k  =  (dE/dOi)  x  (dok/dNetk) 

=-(/k-Ok)  X  (Ok  (1-Ok)) 
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Note  that  Ok  (1-Ok)  =  9{1/(1  +  txp(-Neti)))/dNetk;  it’s  the  -derivative  of  the  sigmoid 
nonlinearity.  That  is  why  that  particular  threshold  function  is  chosen:  because  it  is 
continuous  and  easily  differentiable. 

During  presentation  of  an  input  (forward  pass),  the  values  tk  tmd  Ok  are  readily 
available  at  the  output  layer;  tk  is  one  element  of  the  training  data.  In  the  hidden  layer, 
these  terms  become  tj  and  oj,  of  which  the  the  latter  is  the  actual  output  of  the  jth 
hidden  unit  upon  presentation  of  training  input  r.  The  former,  tj,  is  the  jth  element  of 
what  is  called  the  internal  representation;  this  is  not  known  a  priori. 

Back  propagation^^  represents  the  first  successful  method  for  calculating  an 
equivalent  of  the  target-minus-output  error  for  hidden  layers.  This  error  term  is 
calculated  by  propagating  the  output  layer’s  error  term  backward  through  the  output 
layer’s  weights.  That  is,  it  can  be  shown  that 


dE/doj  -  -5]  fik  Wkj 

k 

The  steps  within  back  propagation  are  thus  summarized: 

Backward  Error  Propagation 

3.  ik  =  Ok  (1-Ok)  (/k-Ok) 

4.  Sj  =  Oj  (1-Oj)  Y. 

k 

5.  AlTkj  =  t;  Oj  5k  A5k  =  »7  5k 

6.  AlFji  =  T?  Oi  5j  £Jj  =  T}  Sj 

In  the  PC-Matlab  modules,  oj  is  a  row  vector  of  length  /;  Netj,  oj,  6j,  and  5j  are 
row  vectors  of  length  J,  and  similarly  for  k.  The  variable  ri  is  called  the  learning  rate. 

The  derivation,  as  presented  in  Reference  11,  leaves  some  questions  open.  For 
example,  are  the  weights  and  biases  to  be  updated  after  presentation  of  each  training 
pair  (updating  by  pattern)?  Or  are  the  weight  changes  to  be  accumulated  in  a  separate 
register  over  the  entire  epoch  of  training  pairs  and  then  added  to  the  weights  (updating 
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by  epoch)?  True  gradient  descent  is  based  on  updating  by  epoch,  but  some  researchers 
advocate  pairwise  updating  as  a  means  to  improved  performance. 

Recall  what  gradient  descent  means:  the  weights  change  in  the  direction  of 
greatest  decrease  in  the  total  mean-square  error.  True  gradient  descent  requires 
infinitesimal  weight  changes.  For  larger  weight  changes,  a  decrease  in  mean-square 
error  is  not  guaranteed.  An  error  landscape  may  be  convoluted,  and  an  update  may 
move  the  weights  too  far,  to  a  point  where  the  error  is  higher.  In  other  words,  a  large 
learning  rate  can  create  oscillations  in  the  plot  of  mean-square  error  vs.  epoch.  A 
convoluted  landscape  possesses  local  minima,  which  can  halt  convergence,  with  the 
network  stuck  in  an  unsolved  state. 

These  problems  are  often  remedied  by  incorporating  a  momentum  term  o  which 
introduces  a  component  of  the  previous  weight  change  into  the  current  one.  Step  5 
above  becomes 


Alfkj(«)  =  +  Q!AlFkj(n-l) 

Atfk(n)  =  »?5k  +  otASk(n-l) 

and  similarly  for  the  hidden  layer,  where  n  tabulates  the  actual  update,  whether  it  was 
pairwise  or  not.  According  to  Gilbert,^*  momentum  is  used  in  updating  by  pattern  to 
incorporate  information  about  the  previous  pair,  making  training  based  on  more 
complete  information. 

2.  Initial  Conditions. 

The  convergence  behavior  produced  by  the  back  propagation  algorithm  depends 
on  the  initial  values  of  the  weights  and  biases.  If  all  weight  values  start  out  equal,  the 
algorithm  keeps  them  so,  and  the  network  will  not  learn.  Rumelhart's  solution  is  to 
initialize  the  weights  and  biases  with  small  random  values,  to  provide  symmetry 
breaking.  He  does  not  state  what  “small”  means,  however. 

Is  there  another  way  to  initialize  the  weights?  Our  desire  is  a  configuration 
which  helps  the  network  to  solution,  but  is  not  specific  to  any  one  problem  (does  not 
“cheat”).  Sheldon  Gilbert^®  proposes  one  method  based  on  Lippmann’s^®  discussions  on 
internal  representations. 


12 


A  network  which  has  learned  XOR  is  shown  in  Figure  3.  Figure  4  shows  the 
output  of  the  upper,  7=1,  hidden  unit  (vertical  axis)  as  a  function  of  the  two  inputs 
(horizontal  axes).  The  magnitude  of  the  input  weight  vector  determines  the  steepness  of 
the  output— how  close  the  “hill”  of  Figure  4  is  to  being  a  step  function.  The  associated 
decision  line,  a  one- dimensional  hyperplane,  is  just  the  intersection  of  this  output  with 
the  plane  Oj(l)  =  0.5.  Each  hidden  unit  in  Figure  3  is  labeled  with  a  diagram  of  its 
decision  region,  showing  this  decision  line.  The  arrow  indicates  in  which  half-plane  the 
output  is  greater  than  0.5.  Note  that  the  dividing  decision  lines  actually  pass  through  the 
decision  region. 

Gilbert’s  method  starts  with  initially  random  weights  and  biases,  and  scales  the 
weight  vectors  to  a  uniform  magnitude.  Then  it  adjusts  the  biases  so  that  the  dividing 
hyperplane  passes  through  (0.5,  0.5,...,0.5)— the  middle  of  the  decision  region.  This  is 
illustrated  in  Figure  5.  Next,  the  weights  to  the  output  layer  are  all  set  to  1/7,  so  that, 
regardless  of  the  size  of  7,  the  output  units  start  out  with  reasonably-sized  Nett's.  For 
example,  even  if  the  initial  oj  =  [1,  1,...,1],  each  Nety^  term  (with  =  0)  will  be  only  7  x 
(1  X  1/7)  =  1,  well  within  the  linear  region  of  the  sigmoid  defined  in  Equation  2.  (In 
point  of  fact,  we  have  initialized  the  terms  to  small  random  values  within  [-0.5,  0.5] 
in  our  simulations.)  In  the  simulations  that  Gilbert  performed  on  2-D  problems,  the 
initializing  algorithm  went  on  to  force  the  dividing  lines  to  span  360*.  We  chose  to  omit 
this  latter  restriction  for  two  reasons:  first,  it  seems  too  “forced”  for  typical  problems, 
and  second,  it  encourages  deciding  which  dividing  lines  should  run  which  way  in  input 
space,  the  answer  to  which  is  best  found  by  already  knowing  the  final  state  of  the  solved 
net. 
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Figure  3.  A  J  =  2  solved  XOR  configuration. 


3.  Creating  Target  Vectors  for  Classes 

For  classification  problems,  one  can  express  the  classes  one  of  two  ways.  The 
first  way  uses  a  dedicated  output  node  for  each  class.  A  typical  training  pair  has  an 
input  vector  and  an  output  vector  with  all  elements  low  except  for  that  corresponding  to 
the  correct  class.  The  second  way  uses  a  binary  representation  for  each  class.  Here  a 
four-class  problem  could  be  implemented  with  two  instead  of  four  output  nodes. 


In  addition,  how  the  output  training  vector  expresses  “high”  and  “low”  is 
important.  Since  the  activation  function  asymptotically  approaches  0  and  1,  using  these 
in  the  training  vector  may  cause  excessively  large  error  signals  to  be  propagated  back. 
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Figure  4.  The  output  of  the  j  =  I  hidden  unit  as  a  function  of  the  inputs  to  the 
XOR  network. 
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Figure  5.  Initial  dividing  hyperplanes  (lines,  in  this  case)  before  and  after 
application  of  Gilbert’s  method. 

All  of  our  simulations  were  two-class  problems,  where  we  used  K  =  1  (binary 
representation).  Also,  unless  otherwise  noted,  we  trained  our  output  on  (0.1,  0.9)  rather 
than  {0,  1)  to  avoid  the  excessive  error  signals. 

C.  Summary  of  Results  from  Phase  I 

Earlier  Matlab  program  modules  enabled  us  to  run  simulations  of  the  back 
propagation  algorithm  using  a  simplified  additive  noise  model.  In  that  model,  small 
Gaussian  noise  terms  were  added  to  all  of  the  weights  immediately  after  they  were 
updated.  We  also  explored  a  variety  of  updating  by  pattern  in  which  hidden  weight 
changes  were  based  on  propagating  the  errors  through  output  weights  which  were 
already  updated.  Although  it  is  quite  afield  from  true  gradient  descent,  this  layered 
updating  by  pattern  converged  the  fastest.  However,  updating  by  epoch  was  found  to  be 
less  noise  sensitive. 

One  must  also  choose  a  suitable  learning  rate,  large  enough  to  speed  performance 
without  oscillations  through  the  error  landscape.  Oscillations  and  entrapment  in  local 
minima  can  be  avoided  using  the  momentum  term  a,  but  as  we  shall  see,  this  does  not 
lend  itself  well  to  optics.  It  is  also  wise  to  choose  targets  away  from  the  asymptotic  tails 
of  the  threshold  function. 
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Performance  is  dramatically  improved  by  initializing  the  weights  so  that  the 
dividing  hyperplanes  created  by  the  hidden  units  cross  the  center  of  the  input  vector 
space,  and  normalizing  the  weight  vector  magnitudes. 
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IV.  OPTICAL  ARCHITECTURE  FOR  BEP 


A.  Overview 

The  simulation  of  a  true  optical  neural  net  must  incorporate  more  than  simply 
additive  noise  applied  to  the  weights  and  biases  after  update.  It  is  necessary  to  take  into 
account  noise  introduced  by  modulators,  the  optical  imaging  system,  and  detectors.  This 
requires  designing  an  architecture  capable  of  handling  the  whole  process,  including  the 
forward  pass,  error  back  propagation,  accumulation  of  the  Alf”s  over  the  epoch,  and 
addition  of  them  at  the  end.  The  architecture  need  not  be  optimum,  only  plausible. 

Our  proposed  architecture  uses  optics  to  perform  operations  of  0{N^),  where  N  is 
the  typical  layer  size,  e.g.,  /,  J  or  Af;  other  operations  may  take  place  in  the  electrical 
domain.  The  inputs  and  weight/bias  matrices  are  assumed  to  be  implemented  as  spatial 
light  modulators  (SLMs). 

Figure  6  shows  our  multilayer  perceptron  architecture  for  a  net  in  which  7=3, 
7  =  4  and  K  -  2.  SLMs  are  represented  as  unshaded  planar  regions;  detectors  are  shown 
shaded.  Thick  arrows  represent  the  propagation  of  information-carrying  light;  thin 
arrows  refer  to  signals  in  the  electrical  domain.  For  simplicity,  the  figure  does  not  show 
the  necessary  cylindrical  and  spherical  imaging  lenses,  nor  the  switchable  birefringent 
wave  plates  needed  to  direct  light  the  proper  way  using  the  polarizing  beamsplitters. 
Wherever  possible,  we  have  striven  to  use  the  same  weights  in  backward  as  in  forward 
passes.  This  saves  hardware  and  reduces  noise  accumulation. 

As  depicted  in  Figure  6,  the  architecture  is  performing  a  forward  pass.  Recall 
the  governing  equations  presented  in  Section  III.  B.  1.  The  forward  passes  are  vector- 
matrix  multiplications  shown  in  Figure  6  by  black  left-to-right  arrows.  While  the  inputs 
Oi  and  Oj  are  unipolar  and  restricted  to  the  interval  [0,  1],  the  weights  and  biases  are 
bipolar  and  can  have  larger  magnitudes.  We  have  tracked  the  weights  and  biases 
through  simulations  on  2-D  and  S-D  problems  and  seen  values  as  large  as  25.  Since 
the  SLMs  do  not  amplify,  we  let  them  express  values  Wfi/h,  where  h  corresponds  to  half 
the  actual  range  of  the  weights.  (For  example,  if  weights  range  from  -25  to  25,  h  is  set 
to  25.)  Also,  we  have  divided  the  matrix  elements  into  positive  and  negative 
subelements.  The  Net^  and  Net^  are  also  so  divided. 
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The  subtraction  of  Net{-)  from  Net{+)  is  performed  electronically.  The 
thresholding  elements  incorporate  sigmoids  which  are  a  steep,  thus  performing  the 
operation 


Oj  =  1/(1  +  exp(a  X  -Netj/h))  =  1/(1  +  exp{-Netj)) 

when  a  =  h,  and  similarly  for  Ok-  The  constant  a  is  similar  to  h\  we  will  examine  it 
more  closely  below. 

Figure  7  shows  the  same  architecture,  now  performing  backward  propagation  of 
errors.  The  error  term  6k  is  computed  electronically  (lower  right  corner  of  Figure  7). 
Note  that  6k  is  bipolar  and  usually  small.  In  the  Phase  I  simulations,  the  largest  element 
never  exceeded  0.16.  To  better  utilize  the  full  range  of  the  1-D  SLMs,  we  multiply  6  k 
by  a  constant  b.  Note  that  Syb  is  assigned  to  two  SLMs;  in  one,  the  values  of  S^b  are 
sign  encoded  with  (+)  and  (-)  subelements.  This  SLM  is  to  be  used  for  computing  the 
outer  product  as  in  Equation  5  of  the  governing  equations.  The  outer  product  between 
this  and  the  unipolar  oj  results  in  a  bipolar  AlFkj,  in  which  the  elements  are  divided 
horizontally  into  (+)  and  (-)  subelements.  This  is  compatible  with  the  encoding  format 
of  IFkj;  hence,  the  weight  update  process  is  simplified. 

The  other  SLM  to  which  S^b  is  assigned  is  for  calculating  the  vector  matrix 
multiplication  inside  Equation  4.  Since  both  6k  and  the  weight  matrix  are  bipolar, 
this  computation  is  more  difficult  to  implement  than  the  forward  pass.  This  requires 
dividing  the  elements  vertically  into  (+)  and  (-)  subelements  as  shown.  As  depicted 
by  the  gray  arrow  in  Figure  /,  the  operation 

M"kj'^//i  X  6k6 

occurs  in  two  passes,  one  for  each  sign  of  6k-  In  the  first  pass,  6k(+)  is  presented;  each 
element  of  the  receiving  detector  will  evaluate  two  terms,  positive  and  negative.  In  the 
second  pass,  when  6k(-)  is  presented,  the  formerly  positive  subelement  now  receives  a 
negative  term 
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Figure  7.  The  proposed  multilayer  perceptron  architecture 
error  propagation. 


iS^bi-)  X  Wi:iVhi+)) 


—and  vice  versa.  The  receiving  detector  has  associated  electronic  circuitry  to  handle  this 
sign  interchange. 

When  this  vector  matrix  multiplication  is  complete,  fij  (from  the  rest  of  Equation 
4)  can  be  computed. 

With  and  computed,  all  that  remains  is  two  outer  products  (Equations  S  and 
6).  These  use  the  light  paths  shown  in  black  arrows.  The  receiving  detectors  are  CCDs 
and  perform  accumulation  over  the  epoch  by  time  integration.  Here  the  operation 
multiplies  bipolar-by-unipolar,  as  in  the  forward  passes. 

At  the  end  of  the  epoch,  the  accumulated  AWji  and  AlTkj  contain  (+)  and  (-) 
terms,  all  nonzero.  For  each  weight,  the  difference  between  its  (+)  and  (-)  terms  is 
computed;  then  that  subelement  of  the  proper  sign  is  set  to  that  value,  the  other  being 
set  to  zero.  Finally,  all  the  changes  are  to  be  added  to  the  weight  matrices  themselves, 
electronically.  The  reason  these  2-D  operations  are  allowed  to  be  electronic  is  that  they 
occur  only  once  per  epoch  and  therefore  do  not  represent  a  bottleneck. 

The  Phase  II  simulations  incorporated  opto-electronic  noise  caused  by  the  various 
transductions  in  the  architecture.  In  addition  to  random  noise  such  as  shot  and  thermal 
noise,  the  effects  of  fixed  pattern  noise,  as  from  nonuniformity  and  limited  contrast 
ratio,  were  included.  In  the  next  section  we  take  up  the  mathematical  nature  of  these 
imperfections  and  demonstrate  their  impact  on  the  convergence  properties  of  the 
algorithm. 

B.  Individual  Imperfections 

A  complete  simulation  accounts  for  all  elements  that  may,  according  to  good 
engineering  judgement,  present  a  potential  impediment  to  the  normal  convergent 
behavior  of  the  algorithm.  The  system  described  above  consists  of  four  subsystems:  the 
sources,  the  SLMs,  the  imaging  optics,  and  the  detectors.  Of  these,  only  the  source 
imperfections  have  been  left  out  of  our  work.  Since  the  input  vectors  to  all  operations 
are  themselves  implemented  using  row  SLMs,  the  entire  system  could  be  illuminated 
with  a  common,  external,  ideal  laser  source.  Compared  to  the  imperfections  present  in 
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the  other  three  subsystems,  we  believe  that  the  laser  source  does  indeed  behave  ideally, 
emitting  an  essentially  constant,  collimated  beam. 

For  the  body  of  simulations  we  designed  a  relatively  simple  problem  we  refer  to 
as  2-D  skewed  corners.  The  problem  requires  two  input  nodes,  one  output  node,  and 
four  hidden  units  to  solve.  Figure  8  shows  the  25  training  data  used  in  the  training. 
The  figure  also  shows  a  typical  back  propagation-induced  solution,  in  the  form  of  a 
contour  plot.  We  used  Gilbert’s  method  to  speed  convergence;  the  steepness  constant  was 
set  to  4.  We  also  chose  ij  to  be  0.45;  this  was  the  largest  observed  value  that  was  not 
prone  to  inducing  oscillation.  We  set  the  magnitude  range  a  at  20;  the  simluation 
automatically  increases  a  to  make  up  the  loss  of  dynamic  range  that  occurs  whenever  a 
bias  is  used  to  smooth  the  uncompensated  weight  transfer  functions.  As  for  the  constant 
b,  we  observed  in  a  trial  run  that  the  5k  and  5j  terms  reached  as  high  as  0.3.  We 
therefore  decided  that  it  was  safer  simply  to  leave  b  at  unity. 


Figure  8.  The  training  set  used  for  most  of  the  simulations.  Also  shown  is  a 
contour  plot  for  a  converged  network,  with  output  values  at  0.3,  0.5,  and 
0.7.  The  dotted  lines  in  this  plot  represent  decision  boundaries  set  by 
each  hidden  unit. 
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We  chose  four  representative  sets  of  initial  conditions.  Using  straight  back 
propagation,  the  four  representative  cases  take  1010,  1400,  1810,  and  1950  (rounded  to 
the  nearest  10)  epochs  (or  cycles)  to  solve,  respectively.  Their  learning  curves  are  shown 
in  Figure  9.  Figure  10  shows  the  same  learning  curves  on  individual  pairs  of  axes, 
superimposed  with  plots  of  the  number  misclassified.  (Strictly  speaking,  there  is  no 
relationship  between  the  units  of  mean-square  error  and  the  number  misclassified, 
despite  their  being  shown  on  the  same  ordinate.)  Unless  otherwise  stated,  w^e  let  each 
trial  run  to  N  =  3000  epochs,  regardless  of  convergence.  We  also  set  the  simulation  to 
record  its  progress  every  DNth  epoch,  with  DN  -  10,  to  save  memory. 

At  this  point  some  remarks  concerning  determination  of  time  to  convergence  are 
in  order.  As  the  simulation  runs,  it  creates  a  long  row  vector  called  Idetmse  (an 
abbreviation  for  “less  detailed  mean-square  error”).  Every  DNth  epoch,  the  program 
appends  to  Idetmse  an  element  which  is  the  mean-square  error  averaged  over  the  last  DN 
epochs.  It  also  at  that  time  appends  to  another  long  row  vector,  noff,  which  is  the 
number  of  training  pairs  misclassified  at  that  time  (not  an  average,  unlike  Idetmse). 
Generally,  the  time  to  convergence  is  determined  by  finding  the  largest  n  value  for 
which  noff  is  nonzero.  In  Matlab,  this  is  done  as  follows: 

x=0;0N!n;  m8X<x.*<noff-s0)) 

For  simulations  without  temporal  noise,  this  metric  is  adequate.  However, 
temporal  noise  in  any  element  usually  manifests  itself  as  spikes  in  graphs  of  both 
Idetmse  and  noff.  Consider,  for  example,  the  plot  of  noff  shown  in  Figure  11.  In  this 
case,  noff  remains  spiked  until  it  reaches  zero  at  about  n  -  2400,  and  then  stays  at  zero 
with  only  an  occasional  spike  to  one  or  two  at  some  higher  values.  The  above  metric 
would  then  yield  that  convergence  had  been  reached  at  n  2850,  which  happened  to  be 
the  last  spike  generated.  This  is  not  useful;  our  metric  should  yield  an  answer  close  to 
2400  for  us  to  compare  performance  between  this  and  other  simulation  runs.  Our 
solution  is  to  divide  noff  into  groups  of  ten  elements  (100  cycles  each  for  DN  =  10), 
take  the  average  over  the  ten  elements,  and  find  the  largest  n  for  which  this  number  is 
above  0.2.  In  Matlab,  the  coding  reads,  for  n  *  3000, 

xb»100:100:3000; 

niax(xb.*(fflean(reshape(noff(2:301},10,30»>0.2}} 
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Figure  9.  Four  representative  learning  curves  generated  by  back  propagation  on  the 
training  data  of  Figure  7. 


number  of  training  inputs  misclassified 


number  misclassified 


Figure  11.  A  typical  learning  curve  corrupted  by  temporal  noise,  in  this  case  in  the 
SLM  electronics.  The  averaging  method  allows  approximation  of 
convergence  time  by  eliminating  transients  that  occur  after  convergence. 


We  have  found  by  examination  of  graphs  of  noff  that  this  empirical  method  works  well, 
in  most  cases.  In  some  cases,  the  spikes  are  particularly  large,  and  so  0.3  must  be  used 
in  place  of  0.2.  Figure  11  shows  the  averages  with  respect  to  epoch.  In  those  cases 
where  this  yields  3000,  the  simulation  time  was  too  short  for  convergence  to  occur 
(assuming  it  would  have).  In  these  instances,  the  next  best  measure  is  the  average  noff 
over  the  last  100  epochs  (ten  values): 


tnean(x  .*(  noff  ( 292 : 301 )  ) 

Initially,  we  considered  each  of  the  various  imperfections  by  itself.  Then  we 
began  combining  key  effects  that  seemed  likely  to  be  the  most  lethal  combinations. 
Finally,  we  looked  at  the  presence  of  all  imperfections  simultaneously. 

1.  The  SLMs 

Figure  12  shows  a  summary  of  the  SLM  imperfections.  In  this  section  we  will 
expand  upon  the  meanings  of  these  imperfections  and  describe  their  effects  on 
convergence. 


a.  The  Effect  of  Malus’s  Law 

The  input  and  hidden  vectors,  as  well  as  the  weights,  biases,  and  error  terms  5k 
and  5j  are  all  represented  as  the  transmittances  of  SLMs.  We  have  modeled  them  as 
Pockel’s-effect  devices,  built  around  materials  that  exhibit  a  birefringence  whose  value 
is  proportional  to  the  applied  electric  field.  The  Pockel’s  cell  has  a  known  thickness  and 
is  placed  between  crossed  polarizers.  Polarized  light  passes  through  the  cell  and,  as  the 
voltage  increases  from  zero,  the  output  polarization  gradually  becomes  elliptical,  then 
circular,  then  elliptical  with  the  major  axis  orthogonal  to  the  input  polarization,  and 
finally  linear  orthogonal.  Further  increases  in  voltage,  though  avoided,  begin  to  reverse 
this  trend. 

The  analyzer  is  orthogonal  to  the  input  polarizer,  so  that  with  no  voltage  applied, 
the  orthogonal  polarization  component  is  minimal  and  the  transmittance  is  zero.  In 
operation,  then,  the  modulator’s  transmittance,  what  we  call  the  optical  weight  oW,  is 
given  by 


oW  -  smHiir/2)eW), 
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Multiplicative 
space-vaiiant  gain 


Nonlinearity 
compensation 
via  a  bias 


Figure  12.  A  model  of  the  SLM  element  and  its  driving  electronics. 

where  eff'  is  the  electrical  value  of  the  weight,  with  0  <  elV  <  1.  (The  value  elV  can  be 
thought  of  as  being  propagated  within  the  “driving  electronics”  box  in  Figure  12.)  This 
relationship  is  an  expression  of  Malus’s  law,  and  it  immediately  raises  an  important  issue: 
since  the  transmittance  is  nonlinearly  related  to  the  applied  voltage,  is  it  necessary  or 
desirable  to  compensate  for  this  effect  using  electronic  predistortion? 


Let  esW  be  the  value  in  the  electronic  storage  register  where  a  weight  or  input  is 
stored  (“electronic  storage  weight”  in  Figure  12).  The  predistortion  is  expressed: 

elT  =  (2/ir)sin'^(esH'^). 
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This  is  shown  in  Figure  12,  near  the  bottom  left.  We  hold  that  while  introducing  such 
electronic  predistortion  into  the  1-D  SLMs  is  certainly  feasible,  having  it  in  the  2-D 
SLMs  may  require  undue  electronic  complexity.  So,  one  set  of  our  simulations  is 
devoted  to  determining  the  effects  of  a  nonlinear  mapping  between  esIV  and  otV. 

Figure  13  shows  the  result  of  this  effect.  A  sign-encoded  pair  of  2-D  SLM 
elements  passes  light  onto  the  associated  detector  elements.  In  each  SLM  element,  the 
applied  voltage  is  proportional  to  the  magnitude.  The  amount  of  light  detected,  then,  is 
proportional  to  the  square  of  the  sine  of  the  magnitude,  as  shown  in  the  curves.  When 
(+)  and  (-)  subelements  are  summed,  the  resulting  transfer  function  exhibits  three 
relatively  flat  regions:  one  near  zero  weight  and  two  at  the  extreme  (+)  and  (-)  weight 
values. 


Figure  13.  The  effect  of  subelement  sign  encoding  combined  with  the  sine-squared 
nonlinearity. 

Figure  14  shows  a  method  to  smooth  out  this  transfer  function.  Instead  of 
predistortion,  simple  biasing  electronics  reduce  the  range  of  input  voltages,  to  avoid 
generation  of  subelement  outputs  in  the  flat  regions.  The  price  paid  is,  of  course,  a 
reduction  in  dynamic  range,  that  is,  an  increase  in  sensitivity  to  stray  voltages.  Note 
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that  this  reduction  has  no  apparent  cost  until  noise  in  the  driving  electronics  is  included. 
If  the  normalized  bias  is  t  (tp2  in  the  Matlab  simulations)  the  electronic  register  holds 

eW  =  esW(l-2t)  +  t. 


Figure  14.  Reducing  the  transfer  function  nonlinearity  by  use  of  a  bias. 

This  is  shown  graphically  in  Figure  12,  near  the  bottom  right.  As  shown  in  Figure  14,  a 
renormalizing  constant  is  necessary  in  the  post-detection  electronics.  This  effects  a 
stretch  in  the  vertical  direction  of  the  transfer  function,  preserving  the  integrity  of  the 
algorithm.  The  renormalizing  constant  is  just 

1  /[sin2((T/2)(  1  -/))-sin2((ir/2)/)]. 

Recall  the  variable  a  in  Figures  5  and  6;  it  is  the  product  of  the  renormalizing  constant 
and  h.  Note  that  if  /  =  0  (no  bias  is  used)  or  if  predistortion  is  used,  a  =  /«. 

In  our  simulations,  we  examined  the  effect  of  the  nonlinearity  without  the  bias 
and  with  different  bias  values.  Figure  15  gives  an  idea  of  how  much  of  a  bias  t 
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normalized  optical  weight 


Figure  15. 


The  SLM  nonlinear  transfer  function  for  several  biases. 


Recall  that  in  the  1-D  SLMs,  we  have  opted  to  retain  predistortion  compensation 
for  the  sine-squared  nonlinearity. 

For  the  four  sets  of  initial  conditions,  we  found  that  the  nonlinearity  has  a 
detrimental  effect  on  convergence  time;  nevertheless  the  system  does  eventually 
converge,  sometimes.  Introducing  a  bias  significantly  reduces  the  detrimental  effect. 
Our  biases  are  in  the  normalized  regime,  where  a  weight  element  can  be  at  most  one. 
Generally,  as  the  bias  approaches  its  limit  of  O.S,  the  time  to  convergence  approaches 
what  it  was  in  the  case  of  predistortion  compensation.  Table  I  shows  the  time  to 
convergence  for  each  case. 


TABLE  I 

Simulation  results  with  and  without  compensation  for  Malus’s  law. 


time  to  convergence 


I.c.t 

zero 

0.05 

0.1 

0.15 

compensated  via  bias 
0.2  0.25  0.3 

0.35 

0.4 

0.45 

pure 

BEP 

I 

1870 

1390 

1240 

1160 

1100 

1070 

1040 

1030  1 

1020 

1010 

1010 

II 

fstuckl 

1740 

1530 

1520 

1580 

1760 

1750 

1600 

1420 

1390 

1400 

III 

>3000 

2130 

1800 

1720 

1700 

1710 

1730 

1760 

1780 

1800 

1810 

IV 

2300 

1780 

1720 

1800 

1850 

1890 

1930  - 

1950 

1960 

1950 

1950 

ficseed  values  for  1, 11,  III,  and  IV  are  419043,  853252,  628191,  and  107470,  respectively. 

b.  Temporal  Noise 

The  analog  electronics  driving  the  SLMs  may  exhibit  temporal  noise.  This  is 
modelled  by  additive  Gaussian  noise  as  shown  in  Figure  12,  and  is  expressed  in  the 
equation 


efV  •<=  efV  +  n 


where  n  is  a  random  number  with  a  normal  probability  distribution  having  ft  =  0  and  a 
specific  variance  a®. 

It  is  important  to  note  that  this  noise  is  added  to  the  applied  voltage  rather  than 
to  the  weight  or  input  itself,  whether  or  not  it  is  preceded  by  predistortion;  that  is,  it  is 
applied  to  eW.  Since  the  sine-squared  nonlinearity  is  then  applied  to  eW,  a  given  noise 
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spike  will  have  a  greater  effect  for  a  weight  or  element  near  the  middle  (efV  =  0.5)  than 
at  the  range  edges  (eW  =  0  or  1). 

As  we  might  expect,  temporal  noise  gives  a  learning  curve  of  mean-jquare  error 
a  spiked,  jagged  aspect.  Figure  16  shows  four  learning  curves,  one  from  a  noise-free 
trial,  and  three  with  different  noise  variances,  but  sharing  the  same  random  number 
generator  seed.  The  effect  of  variance  is  to  govern  the  size  of  the  spikes  as  well  as  the 
degree  to  which  the  mean-square  error  approaches  zero. 

Table  II  shows  the  simulation  results.  The  times  to  convergence  are  computed 
using  the  averaging  metric  referred  to  above.  To  ensure  that  our  results  are  not  too 
much  an  artifact  of  any  one  noise  history,  three  different  random  number  generator 
seeds  were  used. 


TABLE  II 

Simulation  results  with  temporal  noise  in  the  SLM  driving  electonics.  Note  that,  except 
for  the  case  of  0  noise,  values  are  rounded  to  the  nearest  100  by  our  averaging 

algorithm. 


I.C. 

1  .  .  J 

training 

seedf 

time  to  convergence. 

(noff|n= 

2900: 10:301 

BEP 

0.00125 

noise  variance 
0.0025  0.005 

0.0087 

0.015 

I 

a 

1010 

1100 

1500 

1600 

2800 

(0.7) 

b 

1100 

1300 

2000 

(0.3) 

(1.4) 

c 

1100 

1400 

1800 

2800 

(0.6) 

II 

a 

1400 

1700 

1900 

2100 

2800 

(0.8) 

b 

1700 

1900 

2500 

2900 

(2.8) 

c 

1700 

2100 

2300 

2800 

(1.3) 

III 

a 

1810 

1900 

2000 

2400 

2900 

(4.8) 

b 

2000 

2000 

2400 

(0.4) 

(5) 

c 

1900 

2100 

2300 

2900 

(4.5) 

IV 

a 

1950 

2400 

2700 

(0.3) 

(2.5) 

(3.5) 

b 

1500 

2200 

2100 

2900 

(1.6) 

c 

2400 

1700 

2400 

(0.8) 

(1) 

ttneed  val-jei  for  a,  b,  and  c  are  89S820,  205464,  and  490001,  respectively. 
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1,  trseed=  893820 


Learning  curves  for  simulations  with  temporal  noise  in  the  SLM  driving 
electronics. 


c.  Space-variant  Gain 

A  given  SLM  may  not  be  completeiy  uniform  across  all  the  elements.  This 
nonuniformity  is  most  likely  manifest  as  variation  in  thickness  or  electro-optic  response. 
We  model  this  as  a  set  of  element-dependent  multipliers  to  the  applied  voltages.  This  set 
consists  of  random  numbers  s  whose  mean  is  one  and  which  are  normally  distributed. 
That  is. 


eW  <=  eW  X  s. 

Figure  12  illustrates  this.  Note  that  if  a  multiplier  exceeds  one,  then  the  SLM  element 
may  induce  a  birefringent  phase  shift  greater  than  t/2,  creating  a  decrease  in  the 
weight.  Therefore,  with  this  nonuniformity,  the  transfer  function  for  a  particular 
element  may  not  be  monotonic. 

An  element-by-element  nonuniformity  in  the  device  electro-optic  response  can 
be  modelled  by  establishing  a  different  random  number  multiplier  for  each  element. 
The  set  of  random  numbers  follows  a  Gaussian  distribution  with  a  mean  of  one  and  a 
known  variance. 

With  reference  to  Figure  7,  the  SLMs  encode  Oj,  DWji,  oj,  DWju  5k  (unipolar),  5k 
(bipolar),  and  5j.  For  our  problem,  with  /  =  2,  7  =  4,  and  =  1,  the  architecture  then 
possesses  only  fifty-three  elements.  In  the  regime  of  statistics,  this  is  not  a  large 
number.  The  point  is,  it  is  particularly  easy  to  produce  a  “bad  seed”  phenomenon,  in 
the  simulations.  That  is,  though  the  variance  of  the  multipliers  may  be  small,  the  more 
critical  elements  happen  to  have  the  worst  deviations— or  vice  versa.  To  allow  for  this 
possibility,  we  performed  these  simulations  using  four  different  seeds  for  the  random 
number  generator  which  produces  the  multipliers. 

Table  III  shows  the  results.  Judging  by  the  low  variances,  it  takes  only  a  small 
degree  of  electro-optic  nonuniformity  to  render  the  architecture  unfit  for  use.  While  it 
may  appear  that  in  many  instances,  convergence  time  is  being  reduced,  the  point  is  that 
it  is  being  drastically  changed,  and  that  a  different  problem  of  the  same  size  may  very 
well  get  stuck. 
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TABLE  III 

Simulation  results  with  nonuniformity  in  the  SLM  electro-optic  response. 


I.C. 

seedf 

j 

time  to  convereence 

1  ! 

'  BEP  ] 

variance  of  element  response  multipliers 

1  0.003126  0.00626  0.0125  0.025  0.05 

I 

a 

'i 

1010 

1000 

990 

990 

[stuck]  [stuck] 

b 

1010 

1020 

1030 

1110 

1450 

c 

1020 

1030 

1050 

1280 

[stuck] 

d 

1 

1010 

1020 

1030 

1090 

2490 

II 

a 

1400 

1490 

2000 

2100 

2420 

[stuck] 

b  ! 

1720 

1420 

1380 

1440 

1690 

c 

1460 

1630 

1870 

1790 

2980 

d 

1560 

1810 

1390 

1870 

1780 

III 

a 

1810 

1810 

1940 

2850 

[stuck] 

[stuck] 

b 

1920 

2090 

>3000 

[stuck]  [stuck] 

c 

1750 

1830 

>3000 

>3000 

[stuck] 

d 

1810 

1850 

>3000 

[stuck]  1770 

IV 

a 

1950 

1270 

1530 

1940 

2020 

[stuck] 

b  i 

1940 

1250 

1620 

1590 

[stuck] 

c 

1660 

1730 

2020 

2400 

[stuck] 

d 

2540 

2380 

2520 

>3000 

[stuck] 

fivieed  values  for  a,  b,  c,  and  d  arc  698392,  774772,  200493,  and  920146,  respectively. 


d.  Finite  Extinction  Ratio 


The  modulator  devices  are  placed  between  crossed  polarizers,  themselves  a 
possible  source  of  error.  The  error  occurs  when  the  polarizers’  extinction  ratio  is  finite, 
as  depicted  in  Figure  12.  If  V*  is  the  reciprocal  of  the  extinction  ratio,  the  modulator’s 
transmittance  is  given  by 


oW  =  (l-^)sin2((jr/2)eir)  +  V- 


In  our  design,  all  elements  of  a  given  SLM  share  the  same  polarizer  and  analyzer. 
Regardless  of  which  has  the  finite  extinction  ratio,  the  result  should  be  pretty  much  the 
same.  For  simplicity,  we  assumed  that  the  same  quality  of  polarizer  is  used  throughout, 
and  subject  all  operations  to  an  overall  extinction  ratio.  For  example,  for  an  extinction 
ratio  of  100,  all  SLM  elements  set  to  zero  really  pass  0.01;  those  set  to  one  pass  one,  and 
the  entire  curve  is  adjusted  for  the  intermediate  values. 
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Initially  we  tried  a  broad  range  of  extinction  ratio  Values,  using  just  I.C.  II. 
Table  IV  shows  these  preliminary  results.  Then,  using  all  four  sets  of  I.C.s,  we 
narrowed  our  range  of  interest  to  that  shown  in  Table  V. 


TABLE  IV 

Preliminary  results  with  finite  extinction  ratio  in  the  SLMs,  using  I.C.  II. 


extinction  ratio 

time  to  convergence 

oo  (pure  BEP) 

1 

1400 

2154 

1400 

1000 

1400 

464.2 

1400 

215.4 

1420 

100 

1480 

46.42 

1660 

25.14 

1800 

10 

2040 
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TABLE  V 

More  detailed  results  with  finite  extinction  ratio  in  the  SLMs. 


time  to  convergence 
extinction  ratio 


I.C. 

OO 

100 

56.23 

31.62 

17.18 

10 

5.623 

3.162 

1.778 

I 

1010 

1060 

1110 

1190 

1360 

1730 

2860 

>3000 

>3000t 

II 

:  1400 

1480 

1590 

2090 

1830 

2040 

2990 

>3000 

>3000t 

III 

1  1810 

1830 

1860 

1920 

2040 

2360 

>3000 

>3000 

>3000t 

IV 

1950 

1990 

2030 

2090 

2260 

2230 

>3000 

>3000 

>3000t 

tit  is  uncertain  whether  these  trials  were  on  their  way  to  convergence. 


Apparently,  the  system  is  very  tolerant  of  finite  extinction  ratio,  showing 
convergence  even  for  values  less  than  10.  For  the  most  part,  the  poorer  extinction  ratio 
seems  simply  to  decrease  the  effective  learning  rate. 


2.  The  Optical  Imaging  System 

The  optical  imaging  system  consists  of  that  set  of  lenses  and  polarizing 
beamsplitters  which  image  one  device  plane  onto  another.  In  a  given  operation,  such  as 
a  vector-matrix  multiplication,  two  optical  imaging  systems  are  employed:  one  imaging 
the  inputs  onto  the  weight  matrix,  the  other  imaging  the  weight  matrix  onto  a  detector 
array.  The  chief  mechanism  by  which  optical  imaging  systems  introduce  error  is 
crosstalk. 


Figure  17.  Crosstalk  in  an  optical  vector-matrix-multiplier. 
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The  process  of  crosstalk  is  illustrated  in  Figure  17.  Each  input  element 
broadcasts  to  a  weight  SLM  row.  Some  of  the  light,  say  90%,  reaches  the  appropriate 
row.  The  other  10%  is  divided  between  the  two  adjacent  rows.  Similarly,  each  column 
is  optically  summed  onto  one  detector  element.  90%  of  the  light  reaches  that  element; 
the  other  10%  goes  to  the  elements  adjacent  to  it.  Note  that  this  means  some  of  the 
light  designated  for  Oj(+)  adds  to  Oj(-)  and  oj-i(+). 

When  we  first  proposed  this  architecture,  we  considered  two  encoding  layouts. 
In  one,  each  individual  weight  is  divided  into  two  side-by-side  subelements,  namely  the 
(+)  and  (-)  subelements.  This  is  the  approach  shown  in  Figure  17.  Another  layout 
would  consist  of  two  noninterlaced  submatrices,  namely  the  (+)  submatrix  and  the  (-) 
one.  We  embraced  the  former  approach  based  on  a  series  of  trial  vector-matrix 
multplies  using  both  approaches.  We  found  the  former  apprach,  that  of  side-by-side 
subelements,  to  be  more  crosstalk  tolerant  in  the  final  answer. 

As  can  be  seen  in  Table  VI,  the  alorithm  is  sensitive  to  even  small  amounts  of 
crosstalk.  Differences  in  convergence  time  result  from  as  little  as  a  tenth  of  a  percent 
crosstalk.  Curiously,  the  convergence  time  may  decrease.  At  1%  crosstalk,  convergence 
times  are  significantly  changed,  but  the  algorithm  does  converge.  At  6%,  the  network 
performance  is  severely  corrupted. 


TABLE  VI 

The  effects  of  crosstalk  upon  convergence. 


I.C. 

time  to  convergence. 

(noffln: 

=2900:10:3000) 

0% 

0.05% 

0.1% 

0.2% 

0.4% 

crosstalk 

0.6%  1% 

1.4% 

2.8% 

6% 

12% 

■  ■  1 

I 

1010 

1020 

1030 

1040 

1080 

1110 

1180 

1260 

1620 

(4) 

[stuck] 

II 

1400 

1400 

1410 

1440 

1500 

1580 

1820 

2350 

2120 

(4) 

[stuck] 

III 

1810 

1810 

1820 

1830 

1850 

1870 

1940 

2030 

2380 

2360 

(6) 

IV  ; 

1950 

1940 

1930 

1910 

1870 

1830 

1990 

1820 

1880 

(3) 

(10) 

Figure  18  depicts  some  examples  of  learning  curves  for  a  system  with  crosstalk. 
It  is  interesting  to  note  that,  for  this  particular  initial  condition  set  (as  well  as  others), 
increased  crosstalk  appears  to  shift  the  curves  to  the  right.  This  suggests  that  crosstalk 
has  its  greatest  effect  in  the  beginning  stages  of  learning. 
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cycle 

Figure  18.  Learning  curves  for  simulations  with  crosstalk. 

3.  The  Deteci''»-s 

We  have  incorporated  into  the  architecture  two  varieties  of  detector  arrays.  The 
three  linear  arrays  that  compute  oj,  ou,  and  iklFkj  are  PIN  detector  arrays.  The  two 
2-D  arrays  are  CCD  arrays.  They  accumulate  charge  over  all  passes  in  an  epoch  and 
then  impart  this  charge  to  the  2-D  SLMs  after  electronic  amplification  and  other 
processing. 

Both  types  of  detector  exhibit  shot  noise  and  thermal  noise.  Shot  noise  is  signal- 
dependent;  thermal,  signal-independent.  The  derivations  below  concern  the  standard 
deviations  of  each  type  of  noise  in  each  type  of  detector.  They  show  how  to  express 
these  values  in  terms  of  the  algorithmic  units  and  as  a  function  of  maximum  signal-to- 
noise  ratios.  This  allows  one  to  simply  specify  these  ratios  rather  than  the  physical 
constants  (e.g.  operating  temperature)  and  then  run  the  simulations. 

a.  PIN  Shot  Noise 

The  shot  noise  manifests  itself  as  variations  in  the  light- induced  photocurrent. 
The  standard  deviation  is 
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«T.hot  =  (IqPSB)^ 

in  Amperes,  where  q  is  the  electronic  charge,  P  is  the  incident  power  in  Watts,  S  is  the 
sensitivity,  or  responsivity,  in  Amperes/Watt,  and  B  is  the  bandwidth  in  Hz.  Consider 
the  first-layer  forward  pass.  Let  /*n»«x  denote  the  power  when  all  SLM  elements  are 
transparent  (have  a  value  of  1).  Let  oNet^  be  the  normalized  value  assigned  to  one  of  the 
two  subelements  for  a  given  Netj  value.  Since  oNetj  can  be  at  most  /  +  1  (the  “1”  is  for 
0j),  we  can  recast  the  standard  deviation  as 

^*hot  ®  {29[0^^tj/(/  +  l)]Pinax5fi}^. 

The  signal-to-noise  ratio  (SNR)  corresponding  to  maximum  available  detected 
intensity  is 

SNRmax  “  Pm»xS/(2qP maxSB)^ . 

Using  this  equation,  we  can  substitute  for  PtnuxS  in  the  shot  noise  equation  to  yield 

a.hot  =  Pm^S[oNeti/(r  +  i)]Vsnr,mx 


in  Amperes.  In  terms  of  oNetj, 

®'fhot(oA/’e/j)  =  0(hot[(f  +  l)/(-Pmax"S)] 

-  [oNetjil  +  1)]^/SNR,„„. 

Note  that  when  oNetj  reaches  its  maximum,  /  +  1 ,  the  shot  noise  a  becomes 
(/  +  1)/SNR„„. 

For  our  simulations,  we  chose  a  broad  range  of  values  for  the  SNRmax-  We  also 
saw  the  need  for  only  one  training  seed,  unlike  the  in  other  temporal  noise  simulations. 
This  is  because  many  more  random  numbers  are  called  in  the  detection  equation  than  in 
the  weight  updates,  so  the  “bad  seed”  phenomenon  is  less  likely  to  occur. 
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The  results  of  the  simulation  are  shown  in  Table  VII.  The  SNRnux  threshold 
below  which  significant  changes  in  convergence  time  occurs  seems  to  be  around  21.  The 
value  below  which  convergence  is  prevented  completely  is  not  far  from  this. 


TABLE  VII 

Shot  noise  in  the  PIN  detectors.  Note  that,  except  for  the  case  of  oo,  values  are  rounded 
to  the  nearest  100  by  our  de-spiking  algorithm. 


I.C. 

time  to  convergence. 

(POffInr 

=2900;10; 

sooo) 

OO 

215.4 

100 

SNRmax 

46.42  21.54 

10 

4.462 

I 

1010 

1000 

1000 

1000 

1100 

(0.3) 

(5.4) 

n 

1400 

1400 

1400 

1500 

1700 

(0.5) 

(6.3) 

III 

1810 

1800 

1800 

1800 

1900 

(0.5) 

(8.1) 

IV 

1950 

2100 

2100 

2100 

2500 

(1.2) 

(7.6) 

b.  PIN  Thermal  Noise 

The  thermal  noise  is  a  signal-independent  function  of  such  parameters  as 
temperature: 

^thermal  = 

in  Amperes,  where  /cb  is  Boltzmann’s  constant,  T  is  the  temperature  in  Kelvins,  B  is  the 
bandwidth,  and  Rl  is  the  load  resistance  of  the  detector  electronics,  in  Ohms. 

The  SNR  corresponding  to  maximum  available  detected  intensity  is 

SNRmax  —  P  mKK.S/(Ak2irB/Ri^. 

Substitution  for  Ri}  in  the  thermal  noise  equation  yields 

(^thermal  “  PinBX*y/SNRniax 

in  Amperes.  In  terms  of  oNeti, 
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0^thermal(0^C^j)  —  0^th«rmal((I  +  i)/{PmaxS)) 


-  (/  +  1)/SNR„„. 


The  simulation  results,  given  in  Table  Vm,  point  to  a  somewhat  greater 
sensitivity  to  SNRmax  than  was  seen  for  shot  noise.  This  is  not  surprising;  note  that  the 
standard  deviation  for  thermal  noise  is  signal-independent,  actually  corresponding  to  that 
shot  noise  whose  signal  is  fixed  to  the  greatest  value,  /  +  1. 

TABLE  VIII 

Thermal  noise  in  the  PIN  detectors. 


time  to  convergence,  (noffln=2900:io:30oo) 


I.C. 

OC 

215.4 

100 

SNRmax 

46.42  21.54 

=zi^u:iu: 

10 

3UUU/ 

4.462 

I 

1010 

1000 

1000 

1400 

(0.8) 

(9.1) 

(8.8) 

11 

1400 

1400 

1500 

2000 

(0.9) 

(9.0) 

(8.8) 

III 

1810 

1800 

1800 

2100 

(5.3) 

(8.9) 

(8.8) 

IV 

1950 

2100 

2800 

2100 

(4.3) 

(9.1) 

(8.7) 

I 


c.  CCD  Shot  Noise 

The  CCD  noise  equation  derivations  proceed  similarly,  except  that  we  measure 
quantities  in  photoelectrons  accumulated,  rather  than  current.  The  CCD  noise 
calculations  are  used  once  at  the  end  of  each  epoch. 


The  standard  deviation  of  the  CCD  shot  noise  is 


Oiihot  =  CN^ 


in  photoelectrons  actually  released,  where  C  is  a  constant,  and  N  is  the  number  of 
photons  absorbed  (assuming  unity  quantum  efficiency).  Consider  the  first-layer  outer 
product.  Let  Nmax  denote  the  number  of  photons  that  the  detector  element  would 
absorb  in  all  R  passes  in  an  epoch  were  the  SLM  elements  transparent  (valued  at  1).  Let 
oDWji  be  the  normalized  value  assigned  to  one  of  the  two  subelements  for  a  given  DlTj, 
value.  The  most  oDW^  can  be,  then,  is  R.  We  then  recast  the  standard  deviation 

<^ihot  “  CI(oZ)lPji//?)Ninax]^- 
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The  SNR  corresponding  to  maximum  available  absorbed  photons  is 


SNRmax  =  Nni»x/[C(Nnuuc^)] 

Substitution  for  Nmax  in  the  shot  noise  equation  yields 

«T,hot  =  N,aM^oDWii/R)^ /SNR„^ 
in  photoelectrons.  In  terms  of  oDWyu 


<^»hotioDW ji)  =  Cahot(R/Ninuc) 

=  (oDlFj|R)*/SNR  max* 

The  simulation  results  are  given  in  Table  IX.  The  transition  from  the  noise  just 
making  a  difference  to  the  point  at  which  convergence  is  prevented  is  more  gradual  than 
it  was  for  the  PIN  shot  noise  case.  The  algorithm  thus  seems  more  tolerant  of  CCD  shot 
noise  than  PIN  shot  noise. 


TABLE  IX 

Shot  noise  in  the  CCD  detectors. 


I.C. 

time  to  convergence. 

(noffln: 

=2900:10:3000) 

OO 

215.4 

100 

SNRmxx 

46.42  21.54 

10 

4.462 

I 

1010 

1000 

1000 

1000 

1100 

1200 

(5.0) 

II 

1400 

1400 

1400 

1500 

1700 

2200 

(5.6) 

III 

1810 

1900 

1900 

1900 

2200 

1200 

(4.8) 

IV 

1950 

2000 

2500 

1800 

1900 

2000 

(5.0) 

d.  CCD  Thermal  Noise 

The  CCD  thermal  noise  is  independent  of  the  number  of  incident  photons,  and  is 
instead  a  function  of  an  arbitrary  number  n  of  photons; 


^thamwl 


Cn 


in  photoelectrons  actually  released,  where  C  is  a  constant. 
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The  SNR  corresponding  to  maximum  available  absorbed  photons  is 


SNRmax  “  ^inax/[C(n^)] 

Substitution  for  C  in  the  thermal  noise  equation  yields 

^thermal  ~  iVjnax/SNRmax 


in  photoelectrons.  In  terms  of  oDWji, 

—  (^max/SNR  max 

=  R/SNRmax* 

As  shown  in  Table  X,  the  algorithm  is  quite  sensitive  to  thermal  noise  in  the 
CCDs.  Generally,  convergence  is  impeded  for  SNRmax  values  less  than  100  or  so.  Note 
that  for  the  three  lowest  SNRma*  values,  the  number  misclassified  is  the  same  for  all 
four  sets  of  initial  conditions.  Indeed,  at  these  values,  the  learning  curves  are  practically 
identical  and  characterized  chiefly  by  the  noise. 


TABLE  X 

Thermal  noise  in  the  CCD  detectors. 


time  to  convergence. 

(noffinz 

=2900:10:3000) 

SNRmax 

I.C. 

oo 

215.4 

100 

46.42 

21.54 

10 

4.462 

I 

1010 

1100 

1800 

(9.8) 

(12.2) 

(13.0) 

(12.5) 

II 

1400 

1900 

(4.0) 

(10.2) 

(12.2) 

(13.0) 

(12.5) 

III 

1810 

1500 

1900 

(9.8) 

(12.2) 

(13.0) 

(12.5) 

IV 

1950 

1900 

(2.3) 

(lO.O) 

(12.2) 

(13.0) 

(12.5) 

e.  Realistic  Detection  Parameters 


For  most  of  the  imperfections,  crosstalk  for  example,  it  is  not  difficult  to  suggest 
realistic  expectations  of  what  a  given  system  might  be  capable  of.  At  this  point,  we 
shall  briefly  discuss  the  PIN  detectors. 


Recall  the  PIN  shot  noise  derivation:  along  the  way  we  found 
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SNRmax  =  PaiaxS/i2qPia»xSB)^. 

The  electronic  charge  q  =  1.6xl0‘^®  Coulombs.  Let  the  bandwidth  B  =  10^/s  and  the 
responsivity  5-1  Ampere/Watt.  For  each  detector  element,  assume  the  maximum  light 
power  availability  to  be  Pmax  ®  10'®  Watts.  This  gives  SNRnukx(shot)  si  1766. 

In  the  PIN  thermal  noise  derivation,  we  found 

SNRmax  =  Pm^S/iAksTB/RL)^ 

where  ks  =  l.SSxlO"^®  J/K,  and  we  let  T  =  300K.  Rl  is  the  load  resistance,  which  is 
typically  lOOfl.  SNRinax( thermal)  246. 

C.  Combinations  of  Key  Effects 

While  two  or  more  given  effects  may  by  themselves  be  small  enough  to  little 
affect  the  network’s  convergence  property,  they  may  combine  much  more  lethally— or 
cancel  one  another.  Our  approach  for  any  combination  is  to  use  two  sets  of  values.  The 
“minimal”  set  corresponds  to  those  values  which  by  themselves  had  little  effect.  The 
“maximal”  set  are  those  values  which  by  themselves  were  enough  to  delay  convergence 
significantly,  but  not  enough  to  altogether  prevent  it  from  occuring  within  the  normal 
number  of  cycles. 

1.  Temporal  Noise  and  Maius’s  Law  in  the  Weights 

As  we  discussed  in  Section  IV.  B.  I.  a.,  Maius’s  Law  effects  a  nonlinearity  in  the 
weight  transfer  function.  Short  of  predistortion,  this  nonlinearity  may  be  reduced  using 
a  bias,  at  the  expense  of  dynamic  range.  We  should  expect  the  loss  in  dynamic  range  to 
appear  as  an  increase  in  the  effective  temporal  noise.  (This  increase  is  due  to  the 
renormalizing  constant  which  is  necessary  to  preserve  the  integrity  of  the  algorithm.) 

Furthermore,  we  wished  to  test  the  following  hypothesis:  that  there  should  be  a 
“sweet  spot”  in  the  bias  for  a  given,  low,  temporal  noise  value.  That  is,  we  can  find  a 
bias  which  is  large  enough  to  help  overcome  the  flat  regions  of  the  nonlinearity  but 
small  enough  not  to  significantly  increase  the  effective  temporal  noise. 


47 


To  see  this  effect,  we  used  the  very  small  temporal'  noise  variances  of  O.OOS, 
0.0025  and  0.00 125.  Table  XI  shows  the  results;  Figure  19  shows  them  graphically. 
Indeed,  we  can  observe  the  existence  of  a  bias  “sweet  spot,”  whose  value  increases  with 
decreasing  SLM  noise.  That  is,  as  SLM  noise  is  reduced,  larger  biases  can  be  used. 


TABLE  XI 

The  combining  of  the  Malus’s  Law  nonlinearity  with  SLM  temporal  noise.t 


I  time  to  convergence,  (noff|a=29oo:io:80oo) 


bias 

=2wu:iu: 

variance 

0 

0.1 

0.2 

0.3 

0.4 

0.005 

2000 

2000 

(0.3) 

(1.2) 

(4.3) 

0.0025 

2000 

1500 

1500 

2100 

(0.5) 

0.00125 

1900 

1500 

1400 

1400 

2100 

ti  e.  I,  tn«cd=206464. 

3000 

«  2500 

c 
u 

s? 

I  2000 

u 

o 

I  1500 
1000 

0  0.1  0.2  0.3  0.4 

bias 

Figure  19.  The  existence  of  a  “sweet  spot”  in  bias,  whereat  the  bias’s  benefit  and 
harm  are  in  balance.  The  solid  curve  represents  a  noise  variance  of  0.005; 
the  dashed  curve,  0.0025;  the  dotted  curve,  0.00125. 

2.  Crosstalk  and  Shot  Noise 


J=4,  Nonlin.  &  Temporal  SLM  Noise 

- 1 - 1 - 1 - 


J _ I _ L 


As  a  comparison  between  Tables  XII  and  VI  shows,  the  inclusion  of  shot  noise 
(in  both  the  PIN  and  CCD  detectors)  compounded  with  the  effect  with  crosstalk. 
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increases— usually— the  convergence  time.  Table  XXII  shows  'simulation  results  for  two 
sets  of  crosstalk  and  shot  noise  values.  Note  that  we  did  not  use  the  averaging  method 
to  obtain  convergence  time,  even  though  temporal  noise  was  present— the  plots  of  noff 
appeared  to  be  free  of  the  spikes  that  had  necessitated  the  method  for  SLM  temporal 
noise  data. 


TABLE  XII 

The  combining  of  crosstalk  and  shot  noise.f 


I.C. 

BEP 

time  to  convergence 
minimall 

maximalft 

I 

1010 

1040 

1440 

II 

1400 

1440 

2480 

III 

1810 

1820 

1840 

IV 

1950 

2120 

2550 

|tn«cd=893820 

tO.2%  croutalk,  SNRmax  =  tOO  for  both  PIN  ihot  and  CCD  shot  noiie. 
ttl%  croutalk,  SNRmax  =  21.64. 


3.  The  Complete  Detector  Noise  Models 


Generally,  the  effects  of  shot  noise  and  thermal  noise  in  all  elements  of  detection 
do  accumulate.  However,  as  the  data  in  Table  XIII  show,  the  different  effects  do  not 
appear  to  form  a  lethal  combination. 


TABLE  XIII 

The  complete  detector  noise  models.! 


I.C. 

BEP 

time  to  convergence 
minimall 

maximalft 

I 

1010 

1100 

1900 

II 

1400 

1600 

2000 

III 

1810 

2000 

2500 

.Vj 

1950 

1300 

2400 

ttrae«d=89S820 

|SNRmax(PIN  and  CCD  shot)  =  100,  SNRmax(PIN  thermal)  =  215.4,  SNRmax(CCD  thermal)  =  464.2. 
ttSNRmax(PIN  ehot)  =  21.64,  SNRmax(PIN  thermal)  =  46.42,  SNRmax(CCD  ahot)  =  10,  SNRmax(CCD 
thermal)  =  216.4. 
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I 

I 

I 

I 

I 

4.  The  Complete,  Compensated  SLM  Noise  Model  - 

We  have  never  ruled  out  the  possibility  of  using  electronic  predistortion  to 
compensate  for  the  SLM  nonlinearity.  As  we  mentioned  in  Section  IV.  B.  1.,  it  may 
merely  introduce  undue  complexity  into  the  2-D  SLMs.  Recall,  too,  that  in  the  1-D 
SLMs,  we  always  incorporate  this  predistortion. 

Since  the  lack  of  a  predistortion  seems  to  magnify  SLM  noise  sensitivity,  we 
opted  to  test  the  complete  SLM  model  with  compensation.  The  results  appear  in  Table 
XIV. 


TABLE  XIV 

The  complete,  compensated  SLM  noise  model.t 


I.C. 

BEP 

time  to  convergence 
minimal} 

maximalft 

I 

1010 

1100 

2000 

II 

1400 

1800 

2200 

III 

1810 

1900 

2800 

IV 

1950 

1600 

2000 

ttrMed=20S464;  ■Vfccd=200493. 

-  0.0012S,  <72(i)  =  0.004,  ext.  ratio  =  215.4. 
tta^  =  0.006,  o2(i)  =  0.01,  ext.  ratio  =  31.62. 

D.  Hidden-layer  Redundancy 

When  more  hidden  units  than  are  needed  to  solve  a  problem  are  present,  the  back 
propagation  algorithm  often  converges  more  quickly.  For  the  2-D  comers  problem,  we 
have  found  that  back  propagation  chooses  four  hidden  units,  emphasizes  them,  and 
diminishes  the  others,  all  via  Wkj.  Figure  20  dramatically  illustrates  this  trend  in 
learning. 

It  is  possible,  too,  that  an  optical  architecture  with  hidden-layer  redundancy  will 
exhibit  greater  immunity  to  certain  types  of  noise,  and  perhaps  increased  sensitivity  to 
others.  In  any  case,  increasing  J  in  an  optical  architecture  is  certainly  easy,  and  can  be 
done  at  a  substantially  lower  cost  in  computation  time  than  for  a  similar  increase  on  a 
serial  electronic  computer. 
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J=8,  icseed=  635264 


cycle 


Figure  20.  Evolution  of  connection  strengths  from  the  hidden  layer  of  size  /  =  8, 
solving  a  problem  requiring  only  J  =  4.  Also  shown  is  a  contour  plot;  the 
weakly  connected  hidden  units  produce  the  dashed  decision  lines  (one 
outside  of  view). 


To  ascertain  such  changes  in  noise  sensitivity,  we  performed  a  set  of  simulations 
using  the  same  comers  problem,  but  with  J  =  8— twice  the  necessary  number  of  hidden 
units.  Some  preliminary  runs  using  straight  back  propagation  pointed  us  to  the  use  of  a 
new  learning  rate;  »;  =  1.  We  still  initialized  the  network  using  Gilbert’s  method,  with  a 
hill  steepness  of  4,  as  in  the  7  =  4  simulations.  Also,  we  chose  four  sets  of  initial 
conditions;  the  learning  curves  associated  with  these  are  shown  in  Figure  21.  Note  that, 
though  the  curves  appear  to  vary,  the  convergence  times  are  more  similar  than  they  were 
for  7  =  4.  This  is  probably  due  to  the  fact  that  Gilbert’s  method  makes  the  initial 
conditions  more  alike,  and  more  so  with  redundant  hidden  units.  The  times  to 
convergence  are  720,  890,  960,  and  920  cycles,  respectively.  We  ran  each  of  the  7  =  8 
simulations  until  N  =  2500. 
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1. 


The  SLMs 


a.  The  Effect  of  Malus’s  Law 

As  we  had  when  J  was  4,  we  disabled  predistortion  compensation  for  the  SLM 
nonlinearity,  replacing  it  with  a  biased  input  (see  Figure  12).  For  the  most  part,  the 
absence  of  a  bias  resulted  in  the  longest  convergence  times.  But,  as  in  the  7=4  case,  as 
the  bias  approaches  its  limit,  the  convergence  time  approaches  its  value  in  the 
compensated  case.  Table  XV  shows  the  data.  Figure  22  provides  for  rapid  comparison 
of  the  convergence  data  for  both  J  values. 


TABLE  XV 

Simulation  results  with  and  without  compensation  for  Malus’s  law,  for  J  =  8.t 


time  to  convergence 

I.c.t 

zero  0.05  0.1 

compensated  via  bias 

0.15  0.2  0.25  0.3  0.35  0.4 

0.45 

pure 

BEP 

b. 


Temporal  Noise 


The  simulations  incorporating  temporal  noise  generated  learning  curves  whose 
intermittent  spikes  in  noff  were  taller.  Therefore  we  modified  our  averaging  metric  to 
suppress  taller  spikes: 

xb=100:100:3000; 

max(xb.*<mean(reshape<noff(2:301),10,30))>0.3)) 

The  convergence  time  data  appear  in  Table  XVI.  Figure  23  shows  graphically 
the  results  for  both  J  values.  The  different  line  types  correspond  to  different  sets  of 
initial  conditions;  note  three  lines  per  condition  (for  the  three  training  seeds)  appear.  It 
is  difficult  to  deduce  any  significant  change  in  SLM  noise  sensitivity  with  J. 


TABLE  XVI 

Simulation  results  with  temporal  noise  in  the  SLM  driving  electonics,  for  7  =  8.  Note 
that,  except  for  the  case  of  0  noise,  values  are  rounded  to  the  nearest  100  by  our 

modified  averaging  algorithm. 


I.C. 

training 

seedt 

time  to  convergence  (noffln= 

:2400;10:2500) 

BEP 

noise 

0.00062S  0.0012B 

variance 

1  0.0025  0.005 

0.0087  0.015 

I 

a 

700 

700 

800 

900 

(0.5) 

(1.3) 

1 

b 

m 

700 

700 

800 

1100 

2200 

(1.7) 

1 

c 

■ 

700 

700 

800 

1100 

2100 

(1.1) 

II 

900 

900 

900 

1100 

2300 

(1.4) 

V. 

n 

900 

900 

1000 

1200 

(3.4) 

(4.5) 

c 

900 

900 

1000 

1100 

(0.5) 

(1.3) 

III 

960 

1000 

1000 

1000 

1300 

2100 

(1.3) 

b 

1000 

1000 

1100 

1300 

(0.5) 

(4.2) 

c 

1000 

1100 

1100 

1800 

2200 

(1.3) 

IV 

920 

900 

1000 

1100 

1200 

(0.8) 

(1.9) 

i  b 

1000 

1000 

1000 

1200 

2200 

(2.0) 

! 

c 

1000 

1000 

900 

1300 

2200 

(1.1) 

ftneed  values  for  a,  b,  and  e  are  492012,  398092,  and  832919,  respertively.  Due  to  the  increased  spike  sise,  the 
averaging  method  is  baaed  around  0.3  rather  than  0.2. 


time  to  convergence  time  to  convergence 


Figure  23.  Convergence  time  vs.  SLM  temporal  noise  variance  for  two  networks  with 
different  hidden  layer  sizes.  Three  traces  are  shown  per  initial  condition 
set— for  the  three  training  seeds. 
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c. 


Space-variant  Gain 


The  results  suggest  that  the  larger  network  is  more  immune  to  space-variant  gain 
than  is  the  smaller.  See  Table  XVII.  Gain  variances  up  to  0.05  did  not  prevent 
convergence,  whereas  in  the  7  =  4  case  (Table  III),  even  variances  of  0.025  occasionally 
did.  Figure  24  illustrates  the  significant  increase  in  allowable  gain  variance  that  occurs 
with  the  increased  hidden  layer  size. 


TABLE  XVII 

Simulation  results  with  nonuniformity  in  the  SLM  electro-optic  response,  for  7=8. 


time  to  convergence  (noff|n= 

:2400:10:2500) 

variance  of  element  response  multipliers 

I.C. 

seedt 

BEP 

0.003126  0.00625 

0.0125 

0.025 

0.05 

0.087 

I 

a 

720 

720 

720 

700 

710 

800 

(3) 

b 

720 

730 

730 

770 

880 

(3) 

c 

710 

690 

640 

600 

550 

730 

II 

a 

890 

870 

860 

910 

1220 

1880 

(2) 

b 

880 

900 

1160 

1350 

1390 

(3) 

c 

890 

880 

1390 

1920 

1560 

(0«-4) 

III 

a 

960 

960 

840 

740 

700 

680 

(0) 

b 

1 

i 

970 

920 

830 

770 

840 

2080 

c 

1020 

1050 

950 

880 

990 

810 

IV 

a 

920 

930 

1010 

1030 

940 

1160 

(3) 

b 

1020 

1010 

890 

770 

670 

1000 

c 

1 

950 

940 

920 

880 

1100 

1170 

fivseed  vsluei  for  a,  b,  and  c  are  673487,  998853,  and  893289,  respectively. 
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time  to  convergence  time  to  convergence 


d.  Finite  Extinction  Ratio 

As  our  results  show,  the  size  of  the  hidden  layer  seems  to  have  little  effect  on 
the  change  in  time  to  convergence  resulting  from  decreased  extinction  ratio  in  the 
polarizers.  Table  V  and  Table  XVIII  show  the  results  for  7  =  4  and  7=8,  respectively. 
Figure  25  shows  that  increasing  7  does  slightly  decrease  the  rate  of  change  in 
convergence  time  with  ^ — as  well  as  increase  the  probability  of  convergence  at  higher 
values. 


TABLE  XVni 

Simulation  results  with  finite  extinction  ratio  in  the  SLMs,  for  7  =  8. 


I.C. 

time  to  convergence 

oo 

100 

56.23 

extinction  ratio 
31.62  17.18  10 

5.623 

3.162 

1.778 

I 

720 

750 

770 

830 

920 

1130 

1680 

>2500 

[stuck] 

II 

890 

930 

960 

1010 

1130 

1340 

1940 

>2500 

[stuck] 

III 

960 

1000 

1030 

1090 

1200 

1460 

2170 

>2500 

[stuck] 

IV 

920 

960 

1000 

1060 

1180 

1450 

2090 

>2500t  [stuck] 

fit  is  uncertain  whether  this  trial  was  on  its  way  to  convergence. 
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2.  The  Optical  Imaging  System 

In  the  7  =  4  simulations  (Table  VI),  we  found  that  large  enough  values  of 
crosstalk  had  unpredictable  effects  on  convergence  time.  We  also  found  that,  above  6% 
the  convergence  was  effectively  rendered  improbable.  For  7=8,  the  effect  is  a  more 
predictable  increase  in  convergence  time  with  added  crosstalk,  as  can  be  seen  in  Table 
XIX.  Larger  values,  above  6%,  were  tolerable.  Figure  26  shows  graphically  the  results 
for  both  7  values. 


TABLE  XIX 

The  effects  of  crosstalk  upon  convergence,  for  7  =  8. 


I.C. 

time  to  convergence 

0% 

0.05% 

0.1% 

0.2% 

crosstalk 

0.4%  0.8% 

1.6% 

3.2% 

6.4% 

12.8% 

I 

720 

720 

720 

730 

740 

770 

830 

970 

1240 

[stuck] 

II 

890 

890 

900 

910 

910 

940 

1000 

1290 

1240 

[stuck] 

III 

960 

960 

980 

980 

990 

1000 

1030 

1130 

1410 

[stuck] 

IV 

920 

910 

910 

910 

910 

950 

980 

1130 

1330 

[stuck] 
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3. 


The  Detectors 


a.  PIN  Shot  Noise 

Table  XX  shows  the  imulation  results.  As  revealed  in  Figure  27,  there  is  little 
difference  from  the  7  =  4  case  in  the  allowable  SNRmax  range.  Interestingly,  though, 
unlike  in  the  7  =  4  case,  the  learning  curves  of  noff  are  remarkably  spike-free.  The 
averaging  metric  was  not  needed,  therefore,  so  that  the  values  in  the  Table  are  not  all 
divisible  by  100. 


TABLE  XX 

Shot  noise  in  the  PIN  detectors,  for  7  =  8. 


I.C. 

time  to  convergence. 

(noffln: 

=2400:10:2500) 

OO 

215.4 

100 

SNRmax 

46.42  21.54 

10 

4.462 

I 

720 

710 

720 

710 

810 

(1.1) 

(6.9) 

II 

890 

900 

900 

900 

890 

(1.3) 

(6.1) 

III 

960 

960 

960 

960 

1010 

(1.3) 

(7.1) 

IV 

1 

920 

920 

920 

920 

990 

(1.0) 

(5.8) 
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b. 


PIN  Thermal  Noise 


Table  XXI  shows  the  simulation  results.  Here  again,  little  difference  in  the 
allowable  SNRmax  range  is  created  by  having  four  additional  hidden  units.  This  is 
confirmed  in  Figure  28. 


TABLE  XXI 

Thermal  noise  in  the  PIN  detectors,  for  7=8.  Note  that,  except  for  the  case  of  oo, 
values  are  rounded  to  the  nearest  100  by  our  de-spiking  algorithm. 


I.C. 

time  to  convergence. 

(noff]„: 

=2400:10 

:2500) 

oo 

215.4 

100 

SNRmax 

46.42  21.54 

10 

4.462 

I 

720 

700 

700 

1300 

(2.2) 

(9.1) 

(11.0) 

II 

890 

900 

900 

1300 

(2.8) 

(9.3) 

(11.5) 

III 

960 

1000 

1000 

1300 

(1.5) 

(9.1) 

(11.6) 

IV 

920 

900 

1000 

1300 

(1.6) 

(9.2) 

(11.1) 
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time  to  convergence  time  to  convergence 


J=  8,  PIN  Thermal  Noise 


Figure  28.  Convergence  time  vs.  PIN  thermal  noise  for  two  networks  with  different 
hidden  layer  sizes. 
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c. 


CCD  Shot  Noise 


Table  XXII  shows  the  simulation  results.  Although  this  is  confirmed  by  only  one 
trial,  convergence  is  possible  for  an  SNRmax  as  low  as  4.462  for  7  =  8,  unlike  for  7  =  4. 
Figure  29  shows  the  graphical  comparison. 


TABLE  XXII 

Shot  noise  in  the  CCD  detectors,  for  7  =  8. 


time  to  convergence,  (noff|n=2400:io:250o) 


I.C. 

OO 

215.4 

100 

SNRmax 

46.42  21.54 

10 

4.462 

I 

720 

720 

720 

720 

750 

700 

1690 

n 

890 

890 

890 

890 

920 

910 

fstuckl 

in 

960 

960 

970 

990 

1000 

820 

(8.2) 

IV 

920 

920 

910 

890 

880 

610 

(3.5) 
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d.  CCD  Thermal  Noise 

Table  XXIII  shows  the  simulation  results.  Having  7  =  8  can  allow  an  SNRmax  as 
low  as  100.  Figure  30  shows  the  graphical  comparison. 


TABLE  XXIII 

Thermal  noise  in  the  CCD  detectors,  for  7=8. 


time  to  conver 

gence. 

(noffln= 

:2400;10:2B00) 

SNR  max 

I.C. 

oo 

215.4 

100 

46.42 

21.54 

10 

4.462 

I 

720 

710 

610 

(6.3) 

(11.3) 

(12.9) 

(12.2) 

II 

890 

920 

1620 

[stuck]  [stuck]  [stuck]  [stuck] 

III 

960 

1100 

950 

(9.1) 

(11.6) 

(12.9) 

(12.2) 

IV 

920 

800 

600 

(6.3) 

(11.8) 

(12.9) 

(12.2) 
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I - 1 - L _ ^ _ i _ I 

0  0.002  0.004  0.006  0.008  0.01 


1/SNRmax 


J=8,  CCD  Thermal  Noise 


Figure  30.  Convergence  time  vs.  CCD  thermal  noise  for  two  networks  with  different 
hidden  layer  sizes. 
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E.  Combinations  of  Key  Effects  (J  =  8) 


We  performed  simulations  similar  to  those  combinations  we  had  run  with  J  <=  A. 


1.  Temporal  Noise  and  Malus’s  Law  in  the  Weights 


As  was  the  case  for  temporal  noise  by  itself,  the  learning  curve  spikes  were  large 
enough  to  require  using  0.3  in  the  averaging  metric  in  place  of  0.2.  As  in  the  J  =  4 
case,  a  sweet  spot  is  manifest  for  each  of  the  traces..  The  results  are  given  in  Table 
XXIV.  Figure  31  shows  the  sweet  spots  graphically. 


TABLE  XXIV 

The  combining  of  the  Malus's  Law  nonlinearity  with  SLM  temporal  noise,t  for  7  =  8. 


(time  to  convergence,  (noffln=2400:i0:2soo) 


variance 

0 

0.1 

bias 

0.2 

0.3 

0.4 

0.005 

2000 

1400 

2300 

2400 

2500 

0.0025 

1700 

1000 

900 

1100 

2400 

0.00125 

1700 

1000 

900 

800 

1100 

tl.C.  1,  tneed=205464. 
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J=8,  Nonlin.  &  Temporal  SLM  Noise 


Figure  31.  The  existence  of  a  “sweet  spot”  in  bias,  whereat  the  bias’s  benefit  and 
harm  are  in  balance.  The  solid  curve  represents  a  noise  variance  of  0.005; 
the  dashed  curve,  0.0025;  the  dotted  curve,  0.00125. 


2.  Crosstalk  and  Shot  Noise 


As  was  the  case  for  J  =  4,  (See  Table  XII.)  crosstalk  and  shot  noise  effects 
appear  to  accumulate.  This  time,  however,  spikes  do  appear,  and  so  the  averaging 
method,  with  0.2,  was  used.  The  results  are  shown  in  Table  XXV. 


TABLE  XXV 

The  combining  of  crosstalk  and  shot  noiset,  for  J  «  8. 


I.C. 

BEP 

time  to  convergence 
minimalt 

maximaltt 

I 

720 

800 

2500 

II 

890 

900 

1300 

III 

960 

1000 

1300 

IV 

920 

1000 

1000 

ftnecd=606842 

^0.2%  crotftalk,  SNRmax(PIN  ihot)  =  100,  SNRmax(CCD  thot)  =  46.42 
tt3.2%  croMtalk,  SNRmax(PlN  shot)  =  21.S4,  SNRniax(CCD  shot)  =  10. 
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3.  The  Complete  Detector  Noise  Models 

The  results  shown  in  Table  XXVI  suggest  that  the  degree  to  which  the  different 
types  of  noise  detrimentally  interact  is  not  significantly  reduced  by  hidden  layer 
redundancy.  The  equivalent  7  =  4  results  were  presented  in  Table  XIII. 


TABLE  XXVI 

The  complete  detector  noise  modelsf,  for  7=8. 


I.C. 

BEP 

time  to  convergence 
minimall 

maximalft 

I 

720 

760 

[stuck] 

II 

890 

960 

[stuck] 

III 

960 

960 

[stuck] 

IV 

920 

880 

[stuck] 

ttneed=606842 

tSNRniax(PIN  ihot)  =  100,  SNRmax(CCD  ihot)  =  46.42,  SNRm«v(PIN  thermal)  =  100,  SNRniax(CCD 
thermal)  =  464.2. 

ttSNRmax(PIN  ihot)  =  21.54,  SNRmax(PIN  thermal)  =  46.42,  SNRm.v(CCD  ahot)  =  10,  SNRm«v(CCD 
thermal)  =  100. 
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4. 


The  Complete,  Compensated  SLM  Noise  Model 


The  data  are  shown  in  Table  XXVII.  The  “minimal”  effects  do  not  seem  to 
form  a  lethal  combination.  However,  the  learning  curves  for  the  “maximal”  conditions 
took  on  an  extremely  spiked  appearance.  While  in  most  cases,  a  few  spurious  ikes 
after  “convergence”  are  forgiven,  in  this  case  we  judged  the  amount  to  be  unacceptable. 
Figure  32  shows  one  of  the  maximal  learning  curves.  Plainly,  the  excessive  spiked 
appearance  makes  the  use  of  an  averaging  metric  a  mere  exercise. 


TABLE  XXVII 

The  complete,  compensated  SLM  noise  model,  for  J  =  8.t 


I.C. 

BEP 

time  to  convergence 
minimall 

maximalft 

I 

720 

700 

[spiked] 

II 

890 

900 

[spiked] 

III 

960 

940 

[spiked] 

IV 

920 

1000 

[spiked] 

ttr«eed=206464:  •vi*«d=87S487. 

=  0.00125.  Cr2(i)  =  0.004,  «it.  =  215.4. 
tt<72  =  0.006,  =  0.01,  «xt.  ratio  =  17.78. 


This  problem  did  not  occur  for  the  /  *=  4  case.  It  seems  that  one  price  paid  for 
hidden  layer  redundancy  is  in  the  sensitivity  of  the  system  after  a  convergence  of  sorts 
has  already  occured. 
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=  8, 1.  C  I,  full  SLM  model,  ma-dmal 


Learning  curve  for  simulation  with  the  complete  SLM  noise  model,  with 
relatively  large  parameters. 


F.  The  Entire  Architecture 

We  performed  several  sets  of  simulations  in  which  all  imperfections  were 
assumed  to  exist.  To  allow  for  the  likely  possibility  that  predistortion  compensation  for 
the  SLM  nonlinearity  would  be  used,  we  simulated  the  network  with  predistortion 
enabled  and  disabled,  using  a  bias  of  /  =  0.15  in  the  latter  case.  Naturally,  the  values  of 
all  the  other  parameters  must  be  chosen  with  care. 

Let  us  begin  with  the  SLMs.  With  the  current  technology  in  Pockel’s-cell 
devices,  keeping  temporal  noise  in  the  driving  electronics  below  0.2%  is  achievable. 
Space-variant  gain  is  a  function  of  fabrication  uniformity,  primarily  in  thickness.  This 
can  be  controlled  typically  to  within  a  percent.  Extinction  ratios  on  dichroic  sheet 
polarizers  are  typically  10^;  based  on  our  results  concerning  extinction  ratio,  and  given 
imperfect  alignment,  we  allowed  extinction  ratios  to  be  only  250. 

Crosstalk  is  an  easily-prevented  occurence;  its  prevention  simply  requires 
increased  inter-element  spacing.  We  assumed  that  0.2%  would  always  be  present. 

Our  PIN  detector  noise  values  were  obtained  using  an  analysis  like  that  at  the 
end  of  Section  IV.  B.  3.  For  these  simulations,  we  used  SNRniax(PIN  shct'  =  1500,  and 
SNRn,ax(PIN  thermal)  =  250. 

For  the  CCDs,  based  on  currently  available  technology,  we  set  SNRmaxCCCD 
shot)  at  500,  and  SNRmax(CCD  thermal)  at  1000. 

For  these  trials  we  allowed  the  simulations  to  run  longer  than  we  had  for  the 
previous  ones.  That  is,  for  7  =  4,  we  set  N  =  4000;  and  for  7  =  8,  we  set  N  =  3500. 
'i'ables  XXVIII  and  XXIX  show  the  results  for  four  trial  sets.  Tables  XXX  and  XXI 
repeat  the  same  data,  except  that  each  trial’s  simulation  time  is  shown  divided  by  the 
corresponding  “no  effects”  time.  This  makes  interpreting  the  data  easier. 

The  “trial  1”  data  for  both  7  values  show,  for  the  most  part,  an  increase  in 
convergence  time.  In  trial  2,  we  kept  ail  parameters  the  same  except  for  temporal  noise, 
whose  variance  we  increased  to  0.005.  We  felt  that  this  increased  convergence  time  by 
what  might  might  be  considered  excessive;  in  some  cases,  the  convergence  time  doubled, 
for  7  =  4.  So  we  backed  it  down. 
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TABLE  XXVIII 

Simulation  of  the  entire  architecture,  for  j  =  4.t 


I.C. 

nonlinearity 

1  time  to  convergence 

no  effects 

trial  1 

trial  2 

trial  3 

trial  4 

I 

compensated 

1010 

1300 

2100 

1400 

[stuck] 

biased 

1160 

1500 

3300 

1600 

[stuck] 

II 

compensated 

1400 

2900 

2200 

2300 

2200 

biased 

1520 

1800 

2800 

2000 

2300 

III 

compensated 

1810 

3300 

2500 

2900 

1500 

biased 

1720 

1900 

3600 

1900 

4000 

IV 

compensated 

1950 

2000 

2500 

1700 

2900 

biased 

1800 

1800 

3100 

2100 

1900 

SLM  temp. 

0.002 

0.005 

0.002 

0.002 

s.-v.  gain 

0.01 

0.01 

0.01 

0.01 

ext.  ratio 

250 

250 

100 

30 

crosstalk 

0.002 

0.002 

0.005 

0.01 

PIN  shot 

1500 

1500 

1500 

1500 

PIN  thermal 

250 

250 

250 

250 

CCD  shot 

500 

500 

500 

250 

CCD  thermal 

1000 

1000 

1000 

500 

ttrMcd=490001;  >vfeed=6d8S92. 


TABLE  XXIX 

Simulation  of  the  entire  architecture,  for  J  =  S.f 


I.C. 

nonlinearity 

time  to  convereence 

no  effects 

trial  1 

trial  2 

trial  3 

trial  4 

I 

compensated 

720 

800 

1000 

800 

1200 

biased 

830 

900 

1250 

1000 

1100 

II 

compensated 

890 

1000 

1200 

1000 

1100 

biased 

1140 

1100 

1250 

1000 

1000 

III 

compensated 

960 

900 

1300 

900 

1200 

biased 

900 

800 

1250 

900 

1100 

IV 

compensated 

920 

1200 

1300 

1000 

1200 

biased 

840 

1300 

1250 

1000 

2100 

SLM  temp. 

0.002 

0.005 

0.002 

0.002 

s.-v.  gain 

0.01 

0.01 

0.01 

0.01 

ext.  ratio 

250 

250 

100 

30 

crosstalk 

0.002 

0.002 

0.005 

0.01 

PIN  shot 

1500 

1500 

1500 

1500 

PIN  thermal 

250 

250 

250 

250 

CCD  shot 

500 

500 

500 

250 

CCD  thermal 

1000 

1000 

1000 

500 

ttrfMd=492012;  iviMd=S7S487. 
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TABLE  XXX 

Simulation  of  the  entire  architecture,  for  j  -  4.t 


I.C. 

nonlinearity 

time  to  convergence  divided  bv  that  for  no  effects 

no  effects 

trial  1 

trial  2 

trial  3 

trial  4 

I 

compensated 

1 

1.29 

2.08 

1.39 

[stuck] 

biased 

1 

1.29 

2.84 

1.64 

[stuck] 

II 

compensated 

1 

2.07 

1.57 

1.64 

1.57 

biased 

1 

1.18 

1.84 

1.32 

1.51 

III 

compensated 

1 

1.82 

1.38 

1.60 

0.83 

biased 

1 

1.10 

2.09 

1.10 

2.33 

IV 

compensated 

1 

1.03 

1.28 

0.87 

1.49 

biased 

1 

1.00 

1.72 

1.17 

1.06 

SLM  temp. 

0.002 

0.005 

0.002 

0.002 

s.-v.  gain 

O.OI 

0.01 

0.01 

0.01 

ext.  ratio 

250 

250 

100 

30 

crosstalk 

0.002 

0.002 

0.005 

0.01 

PIN  shot  ‘ 

1500 

1500 

1500 

1500 

PIN  thermal 

250 

250 

250 

250 

CCD  shot 

500 

500 

500 

250 

CCD  thermal 

1000 

1000 

1000 

500 

ttneed=490001;  ■v«eed=698392. 

TABLE  XXXI 

Simulation  of  the  entire  architecture,  for  J  =  S.f 


I.C. 

nonlinearity 

time  to  convergence  divided  bv  that  for  no  effects 

no  effects 

trial  1 

trial  2 

trial  3 

trial  4 

I 

compensated 

1 

l.Il 

1.39 

1.11 

1.67 

biased 

1 

1.08 

1.51 

1.20 

1.33 

II 

compensated 

1 

1.12 

1.35 

1.12 

1.24 

biased 

1 

0.96 

1.10 

0.88 

0.88 

III 

compensated 

1 

0.94 

1.35 

0.94 

1.25 

biased 

1 

0.89 

1.39 

1.00 

1.22 

IV 

compensated 

1 

1.30 

1.41 

1.09 

1.30 

biased 

1 

1.55 

1.49 

1.19 

2.50 

SLM  temp. 

0.002 

0.005 

0.002 

0.002 

s.-v.  gain 

0.01 

0.01 

0.01 

0.01 

ext.  ratio 

250 

250 

100 

30 

crosstalk 

0.002 

0.002 

0.005 

0.01 

PIN  shot 

1500 

1500 

1500 

1500 

PIN  thermal 

250 

250 

250 

250 

CCD  shot 

500 

500 

500 

250 

CCD  thermal 

1000 

1000 

1000 

500 

ttneed=492012;  ■viced=S73487. 
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In  trial  3,  we  increased  crosstalk  and  worsened  the  extinction  ratio.  For  the  most 
part,  the  results  are  quite  similar  to  those  from  trial  1.  Evidently,  these  changes, 
particularly  the  one  in  crosstalk,  play  a  less  significant  part  than  a  similar  change  in 
SLM  temporal  noise. 

In  trial  4,  we  further  worsened  extinction  ratio  and  crosstalk,  and  significantly 
reduced  the  CCD  performance.  Again,  these  conditions  seemed  to  have  a  similar  effect 
to  increasing  SLM  temporal  noise,  but  this  time  some  of  the  7  =  4  runs  failed  to 
converge. 

In  contusion,  an  optical  back  propagation  network  solving  a  problem  of  the  size 
of  2-D  corners  will  perform  best  with  extra  hidden  units,  and  will  require  conditions  at 
least  as  good  as  those  set  in  trial  2  or  3. 
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V.  Exemplar- based  Model 


The  B£P  model  for  a  neural  net  classifier  separates  the  input  patterns  by  a  series 
of  hyperplanes.  As  we  illustrated  in  Section  III.  B.  2.,  each  of  the  hyperplanes  is 
associated  with  a  hidden  processing  element  and  the  orientation  and  location  of  the 
hyperplane  is  completely  specified  by  the  weight  vector  associated  with  that  processing 
element.  One  or  more  layers  of  hidden  processing  elements  are  needed  to  form  an 
arbitrary  decision  boundary  via  intersections  of  half  spaces.^® 

Exemplar-based  classifiers^®  use  their  hidden  processing  elements  somewhat 
differently.  The  weight  vectors  associated  with  the  processing  elements  in  the  first  layer 
simply  represent  the  input  patterns  (exemplars).  The  subsequent  layers  then  group 
outputs  from  the  first  layer  processing  elements  and  make  a  classification  decision.  The 
subsequent  processing  differs  between  k-nearest  neighbor  classifiers,^^’^^  Restricted 
Coulomb  Energy  classifiers^^*^^  or  Radial  Basis  Function  classifiers.^®’*®  The  common 
characteristic  of  these  models  is  that  they  have  high  storage  requirements  (proportional 
to  the  exemplars  in  the  training  set)  but  short  training  times.  The  error  signals  in  the 
training  cycle  are  used  mainly  to  adjust  the  weights  in  the  subsequent  layers  or  some 
simple  parameters  (e.g.,  the  size  of  the  basin  of  attraction)  associated  with  the  processing 
elements  in  the  first  layer.  Below  are  some  descriptions  of  those  exemplar-based 
networks  that  we  have  studied  and  simulated. 

A,  The  Nearest-neighbor  Network 

The  simplest  exemplar-based  approach  is  to  assign  to  every  training  pair  an 
exemplar  center.  The  task  of  the  network  would  then  be  to  determine  which  of  these 
“neighbors”  is  the  nearest  to  the  given  input.  It  then  outputs  the  class  of  the  nearest 
neighbor.  There  is  no  training  step,  of  course,  other  than  the  initialization. 

The  distance  computation  is  based  around  a  vector-matrix  multiplication,  not 
unlike  a  hyperplane-based  network.  Let  an  /-element  input  vector  Oi  be  presented,  and 
the  position  vector  of  the  yth  hidden  unit  be  Cji.  Then  the  distance  is 

|Oi-Cji|2  =  N*  +  ICjil*  -  2  Oi  •  Cji 
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and  J  such  distances  exist.  However,  since  |oi|^  will  be  the  same  for  all  hidden  units,  all 
that  is  needed  for  comparison  is  ICjil^  -  2  o\'  Cji.  After  computing  these  J  comparative 
distance  terms,  the  network  passes  only  the  output  value  associated  with  that  hidden  unit 
whose  distance  equals  the  minimum. 

Figure  33  shows  how  a  nearest-neighbor  network  allocates  the  decision  space  for 
the  2-D  skewed  corners  problem.  Also  shown  is  the  network’s  solution  when  nine  of 
the  boundary  data  points  are  missing.  It  is  interesting  to  note  that  back  propagation 
(with  J  =  4)  was  unable  to  solve  this  sparse  version  of  the  corners  problem.  While  the 
network’s  generalization  is  good,  the  efficiency  of  hidden  unit  allocation  is  obviously 
low.  One  solution  might  be  to  use  only  selected  inputs  as  exemplars. 


Figure  33.  The  decision  regions  pro-iuced  by  a  nearest-neighbor  network  for  the 
2-D  skewed  corners  problem. 

B.  The  Nestor  Learning  System 

The  Nestor  Learning  System,  based  on  Restricted  Coulomb  Energy  methods^*-*^, 
is  one  system  that  we  have  studied.  This  method  dynamically  commits  new  hidden  units 
(called  prototype  units)  as  training  progresses,  and  modifies  existing  ones,  until  all 
training  pairs  are  correctly  classified. 

The  neighborhoods  in  this  system  are  hypersherical  in  shape;  their  centers  are 
themselves  co-located  with  selected  training  inputs.  Each  neighborhood  is  tagged  with 
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one  of  the  possible  output  classes  of  the  network.  As  each  training  pair  is  presented,  the 
algorithm 

♦  reduces  the  size  of  those  neighborhoods  which  overlap  and  conflict  with  that 
pair,  and 

♦  commits  a  new  prototype  unit,  centered  at  that  pair,  if  the  pair  is  unclassified 
(that  is,  if  it  is  located  inside  no  neighborhoods). 

Naturally,  the  order  of  presentation  of  the  training  data  strongly  determines 
where  the  neighborhoods  will  end  up;  this  can  give  rise  to  inefficient  configurations, 
where  a  large  portion  of  the  input  space  is  unidentified  or  confused.  One  advantage  of 
this  method  is  that  very  few  cycles  of  training  data  presentation  are  needed  before  the 
network  is  solved.  It  is  important  to  note  that  what  we  have  here  described,  somewhat 
tersely,  is  what  Nestor  Corporation  calls  a  single-layer  system  (despite  the  multilayer 
nature) — the  version  that  is  marketed  by  Nestor  in  fact  uses  a  sophisticated  multilayer 
configuration. 

We  wrote  a  Matlab  simulation  program  for  this  system.  The  program  is  given  in 
the  Appendix,  Section  C.  Figure  34  shows  the  decision  space  allocation  of  a  solved 
network  for  the  2-D  skewed  corners  problem  with  /  =  15.  In  this  case,  the  presentation 
order  was  left-to-right,  beginning  with  the  bottom  row  on  the  graph.  While  only  three 
epochs  were  needed  to  solve  the  problem,  the  poor  generalization  is  self-evident. 
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identified 

unidentified 

confused 


Figure  34.  The  decision  regions  produced  by  a  Nestor  Learning  System  single-layer 
network.  The  network  produces  identified,  unidentified,  and  confused 
reponses  to  inputs  in  the  shown  regions. 


C.  Modified  BEP 


As  mentioned  in  Figure  1,  we  at  BDM  have  developed  an  algorithm  which  uses  a 
gradient  descent  to  effect  learning  in  a  radial  basis  function  network.  Recall  Section  III. 
B.  1.,  in  which  the  forward  pass  equations  and  the  delta  rule  were  used  to  derive 
equations  for  the  weight  and  bias  updates.  The  same  type  of  derivation  can  be  used  for 
a  network  with  different  forward  pass  equations.  In  a  radial  basis  function  network,  the 
hidden  units  produce  outputs  which  exponentially  fall  off  with  increasing  input  distance 
from  a  centroid.  The  rate  of  falloff  varies  with  direction;  therefore,  surfaces  of  constant 
output  are  ellipsoids  rather  than  simply  hypersheres.  The  output  units  form  hyperplanes, 
just  as  in  traditional  BEP.  Thus,  the  forward  pass  equations  are: 

Forward  Pass 

1.  Netj  =  X;  (oi-Cji)2/(7ji2  Oj  =  exp(-Netj) 

i 

2.  Ne/k  =  E  Oi  IFkj  Ok  =  1/(1  +  exp(-Ne/k)) 

j 
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where  Cji  represents  the  locations  of  the  centers,  ITkj  the  weights  in  the  output  layer, 
and  oji  is  a  constant  used  to  shape  the  ellipsoids. 

Recall  the  delta  rule: 

AW  «  -dE/dW],j  where  f  =  E  E  (^k-Ork)* 

r  k 

Since  the  equations  for  Netk  and  Ok  are  identical  to  those  in  conventional  BEP,  so  will  be 
the  equations  for  6k  and  AlTkj.  For  the  hidden  layer  the  delta  rule  calls  for  dEldCyu 
which  can  be  computed  using  the  chain  rule: 

dE/aCji  =  (dE/dNetj)  x  {dNetj/aCn). 

We  can  show  that 

aNeti/aCji  ~  -2  (oi-CjO  /  ay?. 

This  one  term  is  the  only  term  in  the  sum  over  all  i  that  is  nonconstant  with  respect  to 
Cji.  The  other  link  we  will  continue  to  call  -6j: 

aE/aNeti  =  -5j  =  idE/aoj)  X  (aoj/aNetj) 

of  which  ao'i/aNet]  is  just  -oj.  The  first  link  is 

5£/50j  =  a/aOi  {  E  (a£/aNe/k)(5Ne/k/90j)  } 

k 

=  -E  «k  a/aoj{E  =  -E  «k  w^^ 

k  k  k 


Therefore, 


and 


*  -  E  ^k  IFkj  Oj 

k 
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ACji  —  -2  A  5j  (Oi-Cji)  /  Oji^. 


where  A  is  a  learning  rate  constant,  similar  to  t). 

At  this  point  the  hidden  ellipsoid  sizes  oji  remain  to  be  computed.  Beginning 
with  the  chain  rule. 


dE/doji  j  j  ji  dE/da  =  (dE/dNet )  x  (d 


We  can  show  that 


dNetj/doji  =  -2  (oi-CjO^  /  oji*. 

The  other  link  appeared  in  the  hidden  center  discussion  and  is  the  same  -5j.  The  delta 
rule  then  rewritten; 

Aoji  oc  -dEldofi  =  -2  ^  6j  (Oi-Cji)^  / 

where  C  is  yet  a  third  learning  rate  variable.  In  summary,  the  delta  rule  yields  up  the 
following  equations  for  updating  the  output  weights,  hidden  centers,  and  hidden  sizes: 

Gradient  Descent  Equations 

3.  Sk  =  (I -Ok)  Ok-Ok) 

4-  =  -  L 

k 

5.  AlFkj  =  ri  Oj  Sk 

6.  ACji  =  -2  A  fij  (oi-Cji)  /  Oji^ 

7.  Aaji  =  -2  C  5j  (oi-Cji)2  /  Oji^ 

In  the  above  equations,  the  learning  rates  are  given  as  three  different  variables— 17,  A, 
and  because  they  appear  in  differing  types  of  equations.  However,  there  is  no  reason 
why  hyperplane-based  back  propagation  could  not  also  possess  different  learning  rates 
for  the  different  layers. 
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We  coded  modified  back  propagation  in  the  Matlab  modules  given  in  the  Section 
D  of  Appendix.  Figure  35  shows  the  learning  curve  of  one  trial  of  the  system,  where  8 
hidden  units  are  used;  Figure  36  shows  the  corresponding  solution  space.  Notice  that 
only  180  iterations  were  required. 


Modified  BEP  learning  curve 


Figure  35.  A  learning  curve  produced  by  modified  BEP  on  the  2-D  skewed  comers 
problem,  with  7  =  8. 

The  solution  initially  surprised  us;  rather  than  drawing  nearly  circular  ellipses 
about  the  corner  regions,  the  network  discovered  the  horizontal  and  vertical  rows  of  the 
other  class.  Also,  the  network  gives  no  regard  to  the  exact  locations  of  the  corner 
training  data— outside  the  narrow  ellipses.  (By  comparison,  note  Figure  8;  the  0.5  output 
contour  curve  generated  by  back  propagation  is  located  halfway  between  the  locations  of 
the  two  classes,  at  all  corners.)  That  the  generated  solution  depends  on  only  some  of  the 
data  suggests  that  the  generalization  of  this  network  is  poor. 

Note  that  the  ellipsoidal  neighborhoods  are  restrained  to  having  major  axes  along 
one  of  the  input  space  axes.  However,  one  could  begin  with  a  more  elaborate  set  of 
forward  pass  equations,  based  on  ellipsoidal  regions  which  can  be  tilted.  It  is  interesting 
to  speculate  what  solutions  such  a  network  would  generate. 
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VI.  DISCUSSION  AND  CONCLUSIONS 


Our  analysis  of  an  optical  back  propagation  architecture  was  preceded  by  a  study 
of  the  algorithm  itself  and  a  technology-independent  analysis.  In  the  process,  we 
discovered  an  important  and  usually-overlooked  initialization  scheme  developed  by 
Sheldon  Gilbert.  This  technique,  as  well  as  other  enhancements  to  the  original  back 
propagation,  shows  promise  in  significantly  reducing  the  number  of  cycles  to 
convergence.  We  have  also  studied  neighborhood-based  multilayer  feedforward  nets.  In 
particular,  we  proposed  a  novel  approach  to  learning  in  one  such  network  using  radial 
basis  functions.  This,  too,  shows  room  for  enhancement,  though  as  it  is  convergence 
time  is  already  reduced  compared  to  back  propagation. 

The  technology-dependent  analysis  of  optical  neural  networks  has  yielded  a 
wealth  of  useful  information.  Our  requirement  of  full  utilization  of  the  3-D  nature  of 
optical  communications  for  all  operations  that  warrant  it  has  resulted  in  the  design  of  a 
promising  arhcitecture.  We  have  analyzed  this  architecture  in  a  part-by-part  fashion, 
developing  a  rigorous  model  for  the  SLMs,  imaging  system,  and  detection  systems. 

While  excessive  noise  in  a  given  element  may  prevent  convergence  our 
simulations  have  revealed  relatively  little  “noise  accumulation.”  That  a  learning  curve 
can  exhibit  both  a  noisy  appearance  and  convergence  is  evidence  of  the  robustness  of  the 
back  propagation  algorithm.  However,  our  simulations  also  show  us  that  the  presence  of 
an  uncorrected  nonlinearity  due  to  Malus’s  law  will  significantly  degrade  performance. 
Introducing  a  bias  is  a  simple  remedy  to  this  problem;  nevertheless,  it  will  also  be 
necessary  to  insure  that  noise  levels  are  kept  low  for  this  remedy  to  be  useful.  This  may 
not  be  achievable  with  present  SLM  analog  driving  technology;  predistortion 
compensation  would  then  be  required. 

We  have  also  observed  that  performance  and  noise  immunity  are  both  greatly 
increased  by  the  use  of  extra  hidden  units.  Further,  extra  hidden  units  are  not  only  easy 
to  add  in  an  optical  architecture;  it  is  also  likely  they  will  already  exist.  That  is,  the 
number  of  hidden  units  for  a  given  complex  problem  is  usually  not  known  a  priori,  so 
that  extra  hidden  units  will  have  to  be  made  available. 

In  general,  we  have  shown  that  an  optical  back  propagation  architecture  is 
capable  of  full  operation  with  realistic  hardware.  This  is  especially  true  for  the  case  of 
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hidden  layer  redundancy,  and  where  the  SLM  nonlinearity  has  been  predistortion- 
compensated. 
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IX.  CONFERENCE  PRESENTATIONS 


James  J.  Levy,  Ravindra  A.  Athale,  and  Michael  W.  Haney  presented  a  paper  at 
the  Annual  Meeting  of  the  Optical  Society  of  America,  held  November  4-9,  1990  in 
Boston.  Below  are  given  the  abstract  and  summary,  which  appeared  in  the  advance 
program  and  technical  digest,  respectively. 

Abstract 

Computer  simulations  reveal  the  effects  of  noise  in  optical  weight  matrix 
elements,  updated  by  back  propagation,  upon  the  learning  curve  characteristics. 

Summary 

The  back  propagation  algorithm^  has  become  increasingly  popular  in  the  neural 
net  research  community.  Various  optical  implementations  have  been  proposed,  with  the 
hope  of  increased  performance  via  the  parallelism  of  optics.  Realistic  models  of  opto¬ 
electronic  implementations  must  include  the  effects  of  noise.  While  noise  often 
“anneals,”  increasing  the  convergence  rate,  excessive  noise  can  effectively  transform  any 
updating  algorithm  into  a  random  search  among  weight  configurations.  For  the  purpose 
of  evaluating  the  effects  of  component  noise  on  the  performance  of  the  back 
propagation  algorithm,  we  have  developed  a  simulation  program  which  allows  insertion 
of  noise  terms  wherever  appropriate.  The  program  introduces  noise  processes  into  the 
weight  updating  process  during  the  learning  phase.  The  learning  curve  is  defined  as  the 
mean-square  error  vs.  iteration  number;  our  analysis  emphasizes  examination  of  learning 
curves  rather  than  simply  probability  of  convergence.  In  this  paper  we  describe  the 
behavior  and  noise  tolerance  of  back  propagation  as  related  to  hidden  layer  size,  learning 
rate,  and  initial  conditions. 


1.  D.  E.  Rumelhart,  G.  E.  Hinton,  and  R.  J.  Williams,  “Learning  internal 
representations  by  error  propagation,”  in  Parallel  Distributed  Processing,  vol.  1,  D.  E. 
Rumelhart  and  J.  L.  McClelland,  Eds.  Cambridge,  MA;  M.I.T.  Press,  1986,  pp.  318- 
362. 


94 


APPENDIX:  PROGRAM  LISTINGS 


Below  are  given  the  input  and  output  .mat  files  representing  the  2-D  skewed 
comers  problem. 


bDin.mat 

boout.mat 
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0.325 

0 

0.1 

0.425 
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0.825 
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1 

0 

0.1 

0 

0.225 

0.1 

0.325 
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0 
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0.325 

1 

0.1 

0.425 

1 

0.9 

0.825 

1 

0.1 

1 

1 
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A.  Straight  Back  Propagation 


In  choosing  a  problem  for  simulation,  we  look  for  one  which  is  not  prone  to  local 
minima.  Having  chosen  the  problem,  we  also  fine  tune  the  learning  rate  and  the 
parameters  to  be  used  by  the  Gilbert  initialization  process.  We  are  not  unaware  that  in 
the  real  world,  this  luxury  is  usually  not  available;  nevertheless,  we  are  evaluating  not 
algorithms,  but  architectures.  Therefore  we  are  justified  in  fine  tuning  the  solution 
process  and  thereby  making  the  simulations  run  more  easily. 


This  procedure  requires  the  performance  of  many  runs  quickly.  Even  with  all 
the  imperfections  disabled,  the  full-blown  simulation  uses  sign  encoding,  thereby  more 
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than  doubling  the  computation.  We  therefore  used  a  straight  back  propagation  series, 
the  b  series,  for  quick  evaluations  as  these.  The  b  series  consists  of  “bframe.m,”  which 
prompts  for  the  various  parameters;  “binit.m,”  which  loads  the  data  and  sets  up  the 
initial  state  of  the  network;  and  “bctrain.m,”  which  effects  the  training  procedure. 


Any  function  calls  which  are  not  part  of  PC-Matlab  are  listed  in  this  Appendix. 


bframe.ni 
X  bframe  script 

dispCThis  lets  you  run  binit  and  btrain  by  having') 
dispCyou  put  in  all  the  necessary  variables  beforehand.') 

J=input( 'Enter  J:  '); 
icseed=input( 'Enter  icseed:  '); 

sniartinit=input('Enter  1  to  initialize  by  method  of  Gilbert:  '); 
if  smart  ini t==1 

steepness^ input ('Enter  the  initial  hidden  unit  hill  steepness:  '); 
spread=1; 
else 

spread=input( 'Enter  the  initial  condition  spread:  '); 
end 

etas input ('Enter  eta:  '); 

N=input( 'Enter  N:  '); 

DNs input ('Enter  DN:  '); 

fflsetolsinput( 'Enter  the  mse  below  which  training  stops:  '); 

8sinput( 'Enter  the  upper  (and  -lower)  bound:  '); 

binit.m 

X  binit  script 
load  bpin 
load  bpout 
I=ncols(bpin) 

Ksncols(bpout) 

R=nrows( bpout) 
if  exist('keepUji') 

Uj iskeepUj i ;  THETA j=keepTHETAj ; 

WkjskeepWkj;  THETAkskeepTHETAk; 
else 

X  Initialize  the  weights  and  biases 

X  random  nunbers  about  zero  center 

rand( 'uniform' ) 

rand( ' seed ' , i cseed) 

randd.S);  X  to  exercise  generator 

Wji=(rand(I,J)-1  ./2)*spread; 

THETAj=(rand(1,J)-1  ./2)*spread; 

Wki=(rand(J,IC)-1  ./2)*spread; 

THETAkx(rand(1,K)-1  ./2)*8pread; 
if  smart init*=1 
X  the  Wji 
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X  nunerator  is  predat.  magnitude  of  weights  vector 
X  which's  normal  to  dividing  hyperplane  (line  for  1=2) 
x=steepness .  /sqr  t  ( sun(U  j  i .  ''2 ) ) ; 

Uj i=(ones( I , 1 )*x) .*Wj i ; 

X  the  THETAj 
THETA  j»-0.5*siin(Wj  i ); 

X  the  Ukj 

Wkj=(1/J)*ones(J,K); 

end 

keepWji=Uji,  keepTHETAj=THETAj 
keepWkj=Wkj,  keepTHETAk=THETAk 
end 
n=0; 

X  MSE  CALCULATION 
X  Forward  Pass 

NETj=tbpin.ones(R,1)J*[Uji;THETAi]; 

0j=1  ./(1+exp(-NETj)); 

METk=  COi ,  oneslR ,  1 )]  *  tUk  j ;  THETAkl ; 

0k=1  ./(1+exp<-NETk)); 

IdetmseC  1  )=siin(meanc(  (Ok-  bpout ) .  ''2) ) ; 

X  literally,  less  detailed  mse 
X  this  first  one's  not  an  average 
nof  f  ( 1)=stjn(  sum(  (  abs(Ok  -  bpout  )>0 . 4  )  )  )  ; 
noff2(1)=sun(sun((abs(0k-bpout)>0.2})); 
clear  ans  x  X  temporary  variables 

bctrain.m 

X  bctrain  script 

X  Accunulates  weight  change  over  epoch,  THEN  updates... 
stop=n+N; 
while  fKstop 
n=m-1 

X  Error  back  propagation 
0k=(bpout-0k).*0k.*(1-0k); 

DWkj=eta*Oj ■*0k; 

DTHETAk=eta*ones(1 ,R)»Dk; 

Dj»0j.*(1-0j).*(Dk*Wkj'); 

DW j i =eta*bpi n ' *D j ; 

DTHETAj=eta*ones(1 ,R)*D  j ; 

X  Addition  of  the  deltas  to  electronic  register 
Wkj=Wkj+DWkj;  THETAk=THETAk+OTHETAk; 

Wj i =Wj i+DWj i ;  THETA j=THETAj+DTHETAj ; 

X  Only  clipping  remains... 

Ukj=-  CUkj<-a).*a-'^(abs(Ukj)<=a).*Ukj+(Wkj>a).*a; 

THETAk=-(THETAk<-a).»a+(abs(THETAk)«a).*THETAk*(THETAk>a>.*a; 

Wji»-(Wji<-a).*a*(abs(Wji)<=a).*Uji*(Wji>a).*a; 

THETA j=- (THETA j  <-a) .*a+(abs(THETA j )<«a) .*THETAj*(THETAj>a) .*a; 
X  Forward  Pass 

NET j= [bpin,ones(R,1 )1*  rwj i ;THETAj] ; 

0j=1  ./(1+exp(-NETj)); 

NETk= [Oj ,ones(R, 1 )] *  tWk j ;THETAk] ; 

Ok=1  ./(1*exp(-METk)); 

detmse=  [detmse,  sijn(meanc(  (Ok  -  bpout } .  '2 ) }] ; 
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if  reni(n,DM)==0 

ldetinse(n/DN>1  )=inean(detnise); 
nof  f  <  n/DN'i-1  )s8ifn(  suni(  ( abs  (Ok  -  bpout  }>0 . 4  >  }  ) ; 
nof  f  2  ( n/DN>  1)  Esun(  suiK  ( abs  ( Ok  -  bpout )  >0 . 2  )  )  ) ; 
clear  detmse 

dispC convergence  check') 

X  based  on  mse  over  r,  not  individuals 
if  IdetmseCn/DN+DKsmsetol 

dispCless  detailed  mse  within  tolerance') 
break 
end 
end 
end 

clear  stop  Dk  Dj  X  temporary  variables 
meanc.m 

function  s=ffleanc(in) 

X  out^meancCin) 

X 

X  This  function  returns  a  colum  vector  with  the  mean  of 
X  each  row  of  in.  This  includes  the  case  where  in  has  but 
X  one  colunn  (i.e.,  it  won't  then  return  the  scalar). 

X  So  it's  not  strictly  the  transpose  of  mean. 

if  ncols<in)==1 
s=in; 
else 

S3(mean(in'))'; 

end 

ncols.m 

function  nc^ncolsCin) 

X  out^ncolsCin) 

X 

X  This  function  returns  the  nutber  of  columns 
X  in  the  matrix  passed  to  it. 

Inr,nc]=size(in); 

nrows.m 

function  nr=nrows(in) 

X  out=nrows(in) 

X 

X  This  function  returns  the  number  of  rows 
X  in  the  matrix  passed  to  it. 

[nr,nc] =size(in); 
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We  have  also  developed  a  diagnostic  program,  “cpcohtor.m”,  that  prints  out  a 
contour  map  of  the  solved  (or  unsolved)  network,  like  that  shown  in  Figure  7.  This 
program  is  valid  only  where  1  =  2,  and  generates  K  plots  (one  for  each  output  node  k). 

The  program  graphs  J  decision  lines,  delimiting  where  oj  crosses  0.5.  Given 
more  hidden  units  than  are  needed  for  a  problem,  back  propagation  usually  deems  some 
as  unnecessary  and  decreases  their  connections  (IFkj)  to  output  nodes.  On  screen, 
“cpcontor.m”  color-codes  the  decision  lines  accordingly:  those  associated  with  the 
strongest  connections  are  drawn  in  white;  medium  connections  yield  green  decision  lines, 
and  weak  connections  yield  blue  ones. 


The  output  Ok  is  represented  as  a  contour  map.  Red,  green,  and  blue  curves 
correspond  to  ouputs  at  0.3,  0.5,  and  0.7,  respectively. 

Incidentally,  running  “cpcontor.m”  on  an  untrained  net  initialized  with  and 
without  Gilbert's  method  dramatically  illustrates  what  this  method  does. 


cpcontor.m 
X  cpcontor  script 

X  This  program,  valid  only  for  two- input  nets,  generates  the 
X  contour  plot  of  each  output  unit,  hopeful Iv  siiperimposed 
X  with  both  the  hidden  unit  decision  lines  and  the  input, 
load  bpin 
load  bpout 

axisClO  1  0  1])  X  a  more  general  statement  would  be 
X  axis( [min(bpin(:,1}}  max(bpin(:,1))  fflin(bpin(:,2))  max(bpin(:,2})]) 
axis( 'square' } 

K=ncols(bpout);  R=nrows(bpin); 
for  k=1:IC 
if  IC>1 

act=['subplot(22'num2str(k}  '}'} 
eval(act) 

end 

X  Inputs 
for  r=1;R 

if  bpout(r,k)>0.5 

onplots[onplot;bpin(r, :)] ; 
else 

offplot=[offplot;bpin(r, : )] ; 

end 

end 

plot(onplot(:,1},onplot(:,2},'xg'),  hold  on 
plot{offplot(;,1),offplot(:,2), 'og') 
xlabeU'Oid)'),  ylabel('0i{2)') 

X  Dividing  I ines 
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bg8t>imx(abs(Uk  j  ( : ,  k) ' ) ) ; 
for 

x*0:0.1:1; 

Y=(x*Uii(1.j)-one8(1.11)*THEUj(i))./Wii(2.i); 
if  8b8{Ukj(j.k»<<1/3)«bg8t 
plot(x,y, 'b') 

elseif  ((abs(Wkj(j,k})»(1/3}*bgst)i(afa8(Wkj(j,k)><(2/3>*bg8t>> 
plot(x,y. 'g') 

elaeif  ab8(Ukj( j,k))>z(2/3)*bgst 
plot(x,y, 'w') 
end 
end 

X  Contour 

txme , yme] =meshdom( 0 : 0 . 02 : 1 , 0 : 0 . 02 : 1 ) ; 

for  colcos1:51  X  because  of  the  0:0.02:18  above 

NJ=txme(:.colco).yine(:,colco),ones<51,1)l*[Wji;THeTAjl; 

0J=1  ./(1*exp<-NJ)); 

NK= [OJ . ones(51 , 1)] • lUk j ;THETAk] ; 

0K(:,colco)=1  ./<1-»exp(-NK)); 

end 

contourfOK, [0.3, 0.5, 0.7] ,0:0.02:1.0:0.02:1) 
clear  onplot  offplot  x  y  X  Y  NJ  NK  OJ  OtC  colco  j  k  r  xme  yme 
hold  off 
end 

axis( 'normal ' ) 


100 


B.  Optical  Back  Propagation 


The  simulation  program  has  been  continually  evolving  since  the  start  of  the 
contract.  In  the  technology-independent  phase,  additive  noise  was  applied  to  the  weight 
after  each  change.  Technology-dependent  modeling  requires  making  a  commitment  to  a 
particular  architecture,  in  this  case  the  one  in  Figure  5.  Nevertheless,  the  basic  three- 
module  series  approach  has  been  retained.  At  the  end  the  weights  are  coded  back  to 
bipolar  form  for  use  by  “cpcontor.m.” 


Naturally,  “uframe.m”  must  ask  the  user  for  many  more  parameters  than 
“bframe.m”  did,  most  relating  to  the  architecture.  Many  of  the  variables  that  get 
defined  here  act  as  logical  variables  that  will  later  determine  whether  imperfection 
equations  are  performed  or  skipped.  (Only  crosstalk  is  always  computed,  even  if  set  to 
0.)  This  skipping  over  of  equations  not  called  for  significantly  shortens  simulation  time. 


uframe.m 
X  uframe  script 

%  Unlike  bframe.m,  this  is  specific  to  a  dedicated  subpixel 
%  architecture,  with  or  without  correction  for  Malus*  law. 
dispCThis  lets  you  run  uinit  and  uctrain  by  having') 
dispCyou  put  in  all  the  necessary  variables  beforehand.') 

J=input( 'Enter  J:  '); 
icseed=input( 'Enter  icseed:  '); 

smartinit=input( 'Enter  1  to  initialize  by  method  of  Gilbert:  '); 
if  smartinit==1 

steepnesssinputC '  Enter  the  initial  hidden  unit  hill  steepness:  '); 
spreads 1; 
else 

spreads inputC 'Enter  the  initial  condition  spread:  '); 
end 

etasinpuK 'Enter  eta:  '); 

N= input ('Enter  H:  '); 

0N=input( 'Enter  ON:  '); 

insetolsinput( 'Enter  the  mse  below  which  training  stops:  '); 
i njects input ( 'Enter  0  for  no  noise;  2  for  Gaussian  noise:  '); 
if  inject-*0 

t r seeds i nput ( '  Enter  trseed:  '); 

end 

if  injects=2 

varsinputC  Enter  the  variance:  '); 
end 

dispCYou  can  introduce  space-variant  gain  into  the') 
disp{ ' voltage -birefringence  transfer  function.') 
svga in=i nput { 'Enter  0  not  to,  1  to:  '); 
if  8vgain==1 
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svseed=input( '  Enter  the  seed:  '); 
swar=input( '  Enter  the  variance  of  this  gain:  •); 
end 

dispCYoo  can  then  introduce  a  finite  extinction  ratio') 
dispf'to  the  1-D  SLMs.') 
poorext=input( 'Enter  0  not  to,  1  to:  '); 
if  poorext==1 

extratio=input( '  Enter  the  extinction  ratio:  '); 
end 

dispCYou  can  incorporate  sine-related  nontinearities  that’) 
dispCcan  occur  in  architectures  without  electronic') 
disp( 'correction  for  the  law  of  Halus.') 

nonlin=input( 'Enter  0  not  to,  3  for  spatial  encoding  w/  Vbias:  '); 
if  nonlin==3 

tp2=input('  Enter  the  tp2  of  Vbias:  '); 
end 

dispCYou  can  introduce  PIN  detector  noise.') 

PINnoise=input( 'Enter  1  to  do  this:  '); 
if  PlNnoise==1 
if  inject==0 

trseed=inputC  Enter  trseed:  '); 
else 

disp<'  The  simulation  will  share  the  random  number') 

dispC  generator  with  that  of  weight  noise.’) 

trseed 
end 

SNR1smaxsinput( '  Enter  the  maxiiKjn  SNR  permitted  by  shot  noise:  ') 
SNRU=input< '  Enter  the  SNR  permitted  by  thermal  noise:  '); 
end 

dispCYou  can  introduce  CCO  detector  noise.') 

CC0noise= input ('Enter  1  to  do  this:  '); 
if  CC0noises=1 

if  ({PINnoise==0)S(inject==0)) 

trseed=inputC  Enter  trseed:  '); 
else 

dispC  The  simulation  will  share  the  random  nurber') 

dispC  generator  with  that  of  weight  noise  and/or') 

dispC  PIN  detector  noise.') 
trseed 

end 

SNR2smaxsinputC  Enter  the  maxinun  SNR  permitted  by  shot  noise:  '} 
SNR2t=inputC  Enter  the  SNR  permitted  by  thermal  noise:  '); 
end 

hr= input (’Enter  the  upper  (and  -lower)  bound:  '); 
if  nonlin==3 

dispC  The  bias-correcting  renormalizing  constant') 
dispC  does  not  affect  this  bound.') 
end 

b=input( 'Enter  the  Dk  normalizing  constant  (1  or  greater):  '); 
eps=input( 'Enter  the  crosstalk  fraction:  '); 
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uinit.m 


X  uinit  script 

X  dedicated  subpixel  architecture 
load  fapin 
load  bpout 
I-ncols(bpin) 

Ksncols(bpout) 

R=nrows( bpout) 
if  existCkeepUji ') 

Wji»keepWji;  THETAj=keepTHETAj; 

Ukj-keepUkj;  THETAk^keepTHETAk; 
else 

X  Initialize  the  weights  and  biases 

X  random  nuiters  about  zero  center 

randC  uni  form') 

randc ■ seed' , i cseed) 

rand(i,3);  X  to  exercise  generator 

Uji=(rand(l,j)-1  ./2)*spread; 

THETAi=(rand(1,J)-1  ./2)*spread; 

Ukj={rand(J,K)-1  ./2)*spread; 

THETAk=<rand(1,k)-1  ./2)*spread; 
if  smart init»1 
X  the  Uji 

X  numerator  is  predet.  magnitude  of  weights  vector 
X  which's  normal  to  dividing  hyperplane  (line  for  1=2) 
x=steepness./sqrt(siin(Uj  i  .''2)); 

U j i =(ones( 1 , 1 )*x) .*U j 1 ; 

X  the  THETAj 
THETAja-0.5«siin(Uji); 

X  the  Ukj 

Ukj=<1/j)*ones(J,K); 

end 

keepUjisWji,  keepTHETAj=THETAj 
keepUkj=Wkj,  keepTHETAk=THETAk 
end 

if  svgains=1 
randc  normal') 
rand( ' seed ' , svseed) 

X  Creation  of  the  svgain  matrices 
s  vUk  j  =addgauss  ( ones  ( J ,  2*K ) ,  s  wa  r ) 
svTHETAk=addgauss(ones( 1 ,2*K),8vvar) 
svW  j  i  =addgauss(ones(  1 , 2*  J  ) ,  swar ) 
svTHETA  j  =addgauss  ( ones(  1 , 2*  J ) ,  swar ) 
svOi=ones(R,1)*addgauss(ones(1,I),8vvar)  X  unipolar 
sv0j=ones(R,1)*addgauss(ones(1,J),8vvar)  X  "  " 

svoOk=ones(R, 1 )*addgauss(one8(1 ,2*K),svvar) 
svoO  j  >ones  ( R ,  1  )*addgauss  ( ones  ( 1 , 2*  J  ) ,  swar ) 
end 

if  poorext==1,  extbias=1/extratio;  end 
a=hr; 

if  nonlin*»3,  a=hr/(mal2a(1-tp2)-mal2a(tp2));  end 
X  normalizing  and  encoding  (splitting  and  clipping) 
esUkj=encode(Wkj/hr);  esTHETAk=encode(THETAk/hr); 
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esU j  i -encode(W j  i /hr) ;  esTHETA j=encoele(THETA j /hr) ; 

X  No  bias  yet  exists  to  remove 
if  nonlin*=0 

X  Inverse  Malus's  law 

eWkj=imal2a(esUkj);  eTHETAksiinal2a(esTHETAk); 
eUj* isimal2a(esUj i };  eTHETAj=imal2a(esTHETAj ); 
elseif  nonlin*=3 

eWk  jsnencocle(decode(esWkj  ) ,  tp2); 
eTHETAksnencodefdecodeCesTHETAk) , tp2) ; 
eU  j i snencode(decode( esU j i ) , tp2 ) ; 
eTHETA  j  snencode(decode(esTHETA j ) , tp2) ; 
end 

X  Add  just  a  pinch  of  noise 
if  inject==2 

eWk J saddgauss ( eWk j , va r ) ; 
eTHETAksaddgauss(eTHETAk,var); 
ewj i=addgauss(eUj i ,var); 
eTHETAj=addgauss(eTHETAj .var); 
end 

X  Multiply  by  space-variant  gain 
if  svgain==1 

eWkj=eWkj .*svUk] ;  eTHETAk=eTHETAk.*svTHETAk; 
eW] i=eWj i .*svWi i ;  eTHETAi=eTHETAj .*svTHETAj ; 
end 

X  Disallow  voltages<0 
eWkj=(eWki<0).*0  +  <eWkj>»0).*eWki; 
eTHETAk»<eTHETAk<0).*0  ♦  (eTHETAk>aO).*eTHETAk; 
eWii=<eWii<0).*0  +  <eWji>=0).*eWji; 
eTHETAJ=(eTHETAi<0).*0  +  (eTHETA j>»0).*eTHETAj; 

X  Finally,  the  Malus's  law  does  this:  (no  if  statement) 
oUkj=mal2a(eUkj );  oTHETAksmal2a(eTHETAk); 
oWjismal2a(eWji );  oTHETAismal2a(eTNETAj); 

X  And  there  may  be  that  poor  extinction  ratio 

if  poorext=3i 

oUkJs(1-extbias)*oUkj-'’extbia8; 
oTHETAk=( 1 ■ extbi as)*oTHETAk+extbi as; 
oWj i=(1 -extbias)*oUji>extbias; 
oTHETAj=(1 -extbi as)*oTHETAj+extbi as; 
end 

X  Preparation  for  mse  calculation 
X  the  zeroth- iteration  coluin  for  mse 
n=0; 

X  MSE  CALCULATION 
o0i=imal2a(bpin); 

if  inject==2,  oOi=addgauss(oOi,var);  end 
if  svgain==1,  o0i=o0i.*3v0i;  end 
o0i=(o0i<0).*0  +  (oOi>*0).*oOi; 
o0i=mal2a(o0i ); 

if  poorext==1,  oOis(1-extbias)*oOi4'extbias;  end 
X  Forward  Pass 

oNETjsvucross( [oOi ,ones(R,1)] ,eps)*[oUji;oTHETAJ]; 
oNET  jsvcrossf  oNET  j , eps ) ; 
if  PINnoiseo^l 

var1*((I*1)*oNETj/SNRl8max^2)^((H1)*ones(R,2*J)/SNR1t).‘2; 
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oNET  j  =vaddgaus( oNET  j , var1) ; 
end 

NET  j  =decode(oNET  j ) ; 

0j=1  ./<1+exp(-a*NETj)); 
o0j=iiiial2a(0j>; 

if  inject==2,  oOj=addgauss(oOj,var);  end 
if  svgain==1,  oOj=oOj.*svOj;  end 
oOj=(oOj<0).*0  (oOj>=0).*oOj; 

oOjzfflal2a(oOj); 

if  poorext~1,  oOj=(1-extbias)*oOj-»'extbias;  end 
oNETk=vucross(  [oOj  ,ones(R,  1 )]  ,eps)*  [oUkj .-oTHETAk]  ; 
oNETksvcross(oNETk,eps); 
if  PINnoise==1 

var1=({J+1)*oNETk/SNR1smax^2)>(<J+1)*ones(R,2*IC)/SMR1t).^2 
oNETk=vaddgaus(oNETk, var1); 
end 

NETk=decode(oNETK); 

0k=1  ./{1+exp(-a*NETk)) 

IdetmseC 1 }=sum(meanc( (Ok - bpout ) . ^2) ) ; 

X  literally,  less  detailed  mse 
X  this  first  one's  not  an  average 
nof  f  ( 1) =sifn(  sum(  (  abs  (  Ok  •  bpou  t )  >0 . 4  )  )  ) ; 
nof  f  2  ( 1 )  =sifn(  sun(  (  abs  (  Ok  •  bpout )  >0 . 2  )  )  ) ; 
clear  ans  x  X  temporary  variables 

uctrain.m 


X  uctrain  script 

X  dedicated  subpixel  architecture 

X  Accunulates  weight  change  over  epoch,  THEN  updates... 
if  ( ( i n j ect==2) | (P I Nnoi 8e==1 ) | (CCOnoi se==1 ) ) 
rand( 'normal ' } 

end 

if  ((n==0}&((inject-=0)|(PINnoise==1}|(CC0noises=1))) 
rand( 'seed' .trseed) 

end 

stop=n+N; 
while  n<stop 
n=n*-1 

X  Error  back  propagation 
Dk=(bpout-0k).*0k.*(1-0k); 
o0k3encode(0k*b);  X  always  compensated 
oOksimal2a(oOk}; 

if  inject^=2,  o0k>addgau88(o0k,var};  end 
if  svgain==1,  oOk=oDk.*svoOk;  end 
o0ks(oDk<0).*0  +  (o0k»0).*o0k; 
o0k=mal2a(o0k); 

if  poorext==1,  Q0k=(1-extbia8}*o0k'»extbiss;  end 
X  generation  of  Dj  requires  re-encoding  of  Ukj 
ooUkjs[oUkj(:,1;K);oUkj(:,K4'1:2*K)]; 
pPassi'vucrossloOkC : ,  1  :K),eps}*ooUkj '  ; 
pPasssvcrosslpPass, eps ) ; 
if  PINnoise"! 

var1-(K*pPas8/SNRl8inax^2)«(K*one8(R,2*J)/SNR1t).''2; 
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pPasssvaddgausCpPass, var1); 
end 

nPasssvucross(aOk( : ,K+1 :2*K),eps)*ooUkj ‘ ; 
nPass-vcross(nPass, eps ) ; 

nPas8=[nPass(:,J-t'1:2*J),nPass(:,1:J)];  X  electronic 
if  PINnoises=1 

var  1 =(  K*nPass/SMR  1  sinax^2  )♦  (K*ones<R ,  2*  J  )/SMR1 1).''2; 
nPass=vaddgaus ( nPass , va r 1 ) ; 
end 

0  j  =a*0 j . * ( 1  - 0 j ) . *decode( pPass+nPass ) ; 
oDj=encode(Dj);  X  again,  always  compensated 
oDj3imal2s(o0j); 

if  inject==2,  oOj=addgauss(oOj,var);  end 
if  svgain==1,  oOj=oOj .*svoOj;  end 
o0j=<o0j<0).*0  +  (o0j>=0).*o0j; 
o0j=inal2a(oDj); 

if  poorext==1,  oOjs(l-extbias)*oOj->'extbias;  end 
OWTk=eta*vucross( [oO j , ones(R, 1 )] , eps ) ' *vcross(oOk, eps); 
if  CCDnoise==1 

var2=(R*DWTk/SNR2sinax*2)+(R*ones(J+1.2*K)/SMR2t).^2; 
DUT  k=vaddgaus ( OUT k , va r2 ) ; 

end 

DUTjseta*vucross( [aOi,ones<R,1)],eps)'*vcro8s(oDi,eps); 
if  CC0noise==1 

var2=<R*0WT  j/SNR2smax''2)+(R*ones(  1+1 ,2*J)/SMR2t).''2; 
DUT  j  svaddgaus(OUT  j , var2 ) ; 
end 

X  Correction  for  b 
DWTk=OUTk/b;  OUTj=DWTj/b; 

X  Addition  of  the  deltas  to  electronic  storage  register 
esUkjsesUkj+DUTkd :  J, :  }/hr; 
esTHETAk=esTHETAk+OUTk< J+1 , ; )/hr; 
esUj i =esUj i+DUT  j< 1 : 1 , : )/hr; 
esTHETAj=esTHETAj+DWTj{I+1,:)/hr; 

X  Draining  off  the  excess 
esUk j =encode( decode ( esUk j )) ; 
esTHETAk=encode(decode(esTHETAk) ) ; 
esU j i sencode(decode(esU j i ) ) ; 
esTHETA jrencode(decode( esTHETA j ) ) ; 

X  If  it's  not  compensated,  it  at  least  adds  a  small 
X  bias  and  stretches  it. 
if  nonlinssO 

X  Inverse  Malus's  law 

eUkj=imal2a(esUkj};  eTHETAk=inial2a(esTHETAk); 
eUjisimal2a(esUji);  eTHETAj=imal2a(e8THETAj); 
elseif  nonlin==3 

eUk jsnencode(decode(esUk j ) , tp2); 
eTHETAksnencode(decode(esTHETAk) , tp2); 
eUjianencode(decode(esUji },tp2); 
eTHETA j>nencode(decode(esTHETA j ) , tp2}; 
end 

X  Add  just  a  pinch  of  noise 
if  injectM2 

eUkJ>addgaus3(eWkj ,var); 
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eTHETAIcsadclgauss(eTHETAI(,var); 
eU j i =addgauss ( eU j i , var ) ; 
eTHETAj«addgauss(eTHETAj , var); 
and 

X  Multiply  by  space-variant  gain 
if  8vgain==1 

eWkj=eUkj .*svUkj;  eTHETAk^eTHETAk.^svTHETAk; 
eW j i»eUj i .*8vWj i ;  eTHETA j=eTHETAj .•svTHETAj ; 

end 

X  Disallow  voltages<0 
eWkj=(eWkj<0).*0  +  (eWkj>«0).»eWkj; 
eTHETAk=(eTHETAk<0).*0  +  {eTHETAk>=0) .‘eTHETAk; 
eWji=(eWji<0).*0  ♦  (eWji>=0).*eWji; 
eTHETAj=(eTHETAj<0).*0  +  (eTHETAj>=0).*eTHETAj; 

X  Malus's  law  does  this:  (no  if  statement) 

0Ukj=inal2a(eUkj);  oTHETAksinal2a(eTHETAk); 

0Uji=mal2a(eUji);  oTHETAj=mal2a(eTHETAj); 
if  poorext«1 

oUk  j=<  1  -extbi  as)*oUk  j-^extbi  as; 
oTHETAk=(  1  -  extbi  as)*oTHETAk't'extbi  as; 
oWji=<1-extbias)*oWji-t’extbias: 
oTHETAj3(1-extbias)*oTHETAj'»'extbias; 
end 

X  Forward  Pass 
o0isimal2a(bpin); 

if  injectK2,  o0i»addgauss(o0i,v8r);  end 
if  svgain==1,  oOi«oOi.*svOi;  end 
o0i*(o0i<0).*0  ♦  (o0i>»0).*o0i; 
oOi=mal2a<oOi); 

if  poorext«1,  o0i-(1-extbias)*o0i>extbias;  end 
oNETjsvucrossl [oOi,ones(R,1}],eps}*(oWji;oTHETAj]; 
oNET j svcross ( oNET j , eps ) ; 
if  PINnoise~1 

var1=<<I+1)*oNETj7SNR1smax''2)*«I+1)*ones(R,2*J)/SHR1t).''2 
oNET jsvaddgausloNET j , var1 ); 
end 

HETj-decode(oNETj); 

0j=1  ./<1*exp(-a*METj)); 
o0jsinial2a(0j); 

if  injects-^,  o0j«addgau8s(o0j,var);  end 
if  svgain==1,  oOj=oOj,*svOj;  end 
o0j=(o0j<0).*0  ♦  (o0j>=0).*o0j; 
o0j=inal2a(o0j); 

if  poorext==1,  oOj=(1-extbias)*oOj+extbias;  end 
oNETksvucrossC [o0j,ones(R,1)] ,eps)*[oUkj;oTHETAk] ; 
oNETk=vcross(oNETk,eps}; 
if  PINnoise»1 

var1*({J4-1)*oNETk/SMRl8max*2)+<(J+1)*one8<R,2*K)/SNR1t).''2 
oNETk>vaddgaus(oNETk, varl ); 
end 

NETk>decode(oNETk); 

0k*1  ./(1+exp(-a*METk>); 

detmse* [detmse, 8un(meanc( (Ok- bpout ) . ^2} )] ; 

if  rem(n,DN)a«0 
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I  detiiise(  n/DN4^  1)  >iiiean(  detmse ) ; 
nof  f  ( n/DN-*- 1  )»siin(  sum(  ( abs  ( Ok  -  bpout )  >0 . 4  )  ) ) ; 
nof f 2( n/0M+ 1) ssunC  sum( ( abs (Ok - bpout  >>0.2))); 
clear  detmse 

disp( 'convergence  check'} 

X  based  on  mse  over  r,  not  individuals 
if  ldetmse(n/DN+1)<s:msetol 

dispCless  detailed  mse  within  tolerance') 
break 
end 
end 

end 

X  Decoding  for  compatibility  with  diagnostic  programs 
Uk j  =a*decode( oUk j ) ;  THETAk=a*decode( oTHETAk) ; 

W j i =a*decode(oU j i ) ;  THETA j=a*decode(oTHETAj ) ; 

X  Clearing  of  temporary  variables 
clear  stop  Dk  oOk  ooUkj  pPass  nPass  Oj  oOj 

addoauss.m 

function  ny«addgauss(in,var} 

X  out=addgauss(  in, spread) 

X 

X  This  function  adds  small  random  nuikiers 
X  with  a  normal  probability  distribution 
X  to  each  of  the  elements  in  in. 

X  NOTE:  It  is  ASSUMED  here  that  the  user  has 
X  already  switched  the  rand  mode  to  'normal'; 

X  the  default  is  uniform. 

[nr,nc]=size(in); 
ny= i  n*  va r* rand( nr , nc ) ; 

vaddgaus.m 

function  ny=vaddgaus{in,var) 

X  out°vaddgaus(in,vsr} 

X 

X  This  function  adds  small  random  nunbers 
X  with  a  normal  probability  distribution 
X  to  each  of  the  elements  in  in,  with  a  location- 
X  depenedent  variance. 

X  NOTE:  It  is  ASSUMED  here  that  the  user  has 
X  already  switched  the  rand  mode  to  'normal'; 

X  the  default  is  uniform. 


[nr,nc]=size(in); 

ny= i n+ va r . * rand( nr , nc ) ; 


encode.m 


function  eW=encode(W) 

X  This  function  splits  a  normalized  r  x  c  matrix  W  into 
X  an  r  X  2c  matrix  eU  wherein  one  r  x  c  submatrix  posseses 
X  all  the  (+)  parts  of  W,  the  other  <-),  one  of  which  is  in 
X  each  pair  is  0.  It  also  clips  values  to  1. 

X  For  negative  elements,  the  (-)  part  is  >  0. 

X  Convert  U  into  +  and  -  encoding 
W=t((W<0)*0  +  (U>0).*W),-{(W<0).*W  ♦  <W>0)*0)1; 

X  clipping 

eW=(W<1).*W  +  (W>=1); 
neneode.m 

function  eW=nencode(W,tp2) 

X  This  fmction  splits  a  normalized  r  x  c  matrix  W  into 
X  an  r  X  2c  matrix  eU  wherein  one  r  x  c  submatrix  posseses 
X  all  the  ('•’)  parts  of  W,  the  other  (-),  one  of  which  is  in 
X  each  pair  is  tp2.  But  the  proportion  isn't  one,  as 
X  in  mencode.m. 

X  For  negative  elements,  the  (-)  part  is  >  0. 

X  Convert  W  into  >  and  ■  encoding 

U=t<(W>=0).*<U*(1-2*tp2)+tp2)+(U<0)*tp2),<(W>=0)*tp2*(W<0).*{W*{2*tp2-1)*tp2))]; 
X  clipping 

eW=(W<»1-tp2).*W  ♦  <W>1-tp2).*(1-tp2); 
decode .m 


function  W=decode(eU) 

X  This  function  accepts  an  r  x  2c  matrix  eU,  itself 
X  containing  (+)  and  (-)  submatrices  ((-}>0), 

X  and  produces  W,  an  r  x  c  matrix  in  which  each  element  is 
X  the  difference  of  the  two  former  subelements. 

U=eU( : ,  1  :ncols(eW)/2)*eW( :  ,ncols(eW)/2'''1  :ncols(eU)); 

mal2a.m 


function  eU=mal2a(eW) 

X  This  function  accepts  a  normalized  r  x  2c  matrix  eW, 

X  itself  containing  (♦)  and  <-)  submatrices  ((-)>0), 

X  and  performs  the  Malus's  law  nonlinearity  on  every  element 

X  therein. 

eW=8in(pi/2*eU).''2; 

imal2a.m 

function  eU3imal2a(eW} 

X  This  function  accepts  a  normalized  matrix  eU 
X  and  performs  the  inverse  of  Halus's  law  on 
X  every  element  therein. 
eW=2/pi*asin(eW.*0.5); 


109 


cross. in 


function  x=cross(in,ep) 

X  Performs  crosstalk  as  though  in  is  encoded 

X  Ein1+,in1-,in2+...]  for  Hatlab  encoding 

X  rin1+,in2+, ...in1-,in2-,...] . 

nc=ncols<in); 

inp=in(1:nc/2); 

inn=in(nc/2+1 :nc); 

xp=(  1  -  2*ep>*  i  np*-ep*  ( i  nn+ to ,  i  nn(  1 :  nc/2  - 1 )]  ) ; 
xn={1-2*ep)*inn+ep*(inp«'[inp(:,2:nc/2),0] ); 
x=  [xp,xn]; 

vcross.m 

function  x=vcross(in,ep) 

X  Performs  crosstalk  as  though  in  is  encoded 

X  tin1+,in1-,in2+...]  for  Matlab  encoding 

X  [in1+,in2+,...in1-,in2-,...l. 

nc^ncolstin); 

nr=nrows(in); 

inp=in(;,1;nc/2); 

inn=i n( ; , nc/2+1 :nc); 

xp=(  1  ■  2*ep)*i  np<-ep*(  i  nrH  tzerostnr ,  1 ) ,  i  nn( : ,  1  ;nc/2- 1 )]  ) ; 
xn=<  1  -  2*ep)*  i  nrHep*(  i  np*  C  i  np( : ,  2  :nc/2) ,  zeros(nr ,  1 ))  ); 
x=  [xp,xn] ; 

ucross.m 

function  x=ucross<in,ep) 

X  crosstalk  for  unipolar  vectors 
nc=ncols(in}; 

x=(1  ■2*ep)*in+ep*<  [0,  in(1  :nc-1  )]•*■  [in(2:nc),0] ); 
vucross.m 

function  xsvucross(in,ep) 

X  crosstalk  for  unipolar  matrices 

nc=ncols(in}; 

nr=nroHS(in); 

x=(1-2*ep)*in*ep*( [zeros(nr,1),in(;,1:nc-1)]+tin(:,2:nc),zeros(nr,1)] ) 
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C.  Nestor  Learning  System 


Below  are  given  the  input  and  output  .mat  files  representing  the  2-D  skewed 
corners  problem.  The  output  is  coded  differently  than  for  back  propagation. 


nlin.mat  ntout.mat 


0 

0 

1 

0 

0.325 

0 

1 

0 

0.425 

0 

0 

1 

0.825 

0 

1 

0 

1 

0 

1 

0 

0 

0.225 

1 

0 

0.325 

0.225 

1 

0 

0.425 

0.225 

0 

1 

0.825 

0.225 

1 

0 

1 

0.225 

1 

0 

0 

0.525 

0 

1 

0.325 

0.525 

0 

1 

0.425 

0.525 

0 

1 

0.825 

0.525 

0 

1 

1 

0.525 

0 

1 

0 

0.725 

1 

0 

0.325 

0.725 

1 

0 

0.425 

0.725 

0 

1 

0.825 

0.725 

1 

0 

1 

0.725 

1 

0 

0 

1 

1 

0 

0.325 

1 

1 

0 

0.425 

1 

0 

1 

0.825 

1 

1 

0 

1 

1 

1 

0 

We  coded  the  Nestor  RCE  layer  using  the  same  three-module  approach  as  we  had 
in  straight  back  propagation.  Note  that  the  variables  regular  and  minimum  do  not  affect 
the  learning. 
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nlf rame.m 


X  nlframe  script 

dispCThis  lets  you  run  nlinit  and  nltrain  by  having') 
dispCyou  put  in  all  the  necessary  variables  beforehand.') 
regularsinput( 'Enter  the  regular  strength:  '); 
mininniiFinputl 'Enter  the  mininun  strength:  '); 

X  The  above  strengths  do  not  affect  training, 
thetami ns inputC 'Enter  the  minimun  neighborhood  size:  '); 
thetaRiax=input( 'Enter  the  maximiin  neighborhood  size:  ■); 
resolutionsinputC 'Enter  the  resolution  cell  size:  '); 
alpha=input( 'Enter  the  size  reduction  factor:  '); 

nlinit. m 

X  nlinit  script 
load  nlin 

load  nlout  X  I  think  this  should  have  K  at  least  two 
I=ncols(nlin) 

Js1  X  at  the  outset... 

K=ncols(nlout) 

R=nrows<nlin) 

X  Initialize  the  weights- --acc.  to  the  first  training  pair 
Uji=nlin(1,:)' 

THETA  jsthetainax 

Ukj=regular*nlout(1,:)  X  does  not  affect  training.  However, 
class(:,1)*nlout(1,;)'  X  does  affect  it  training 

distance=sqrt<sum(nlin' .'2)'*ones(1,J)  +  ones(R,1)*stjn(Wji.''2)  -  2*nlin*Wji) 
n=0 


nltrain. m 


X  nltrain  script 

X  We  assune  it  only  takes  a  few  iterations,  so  no  need 

X  to  specify  N. 

solvedsQ;  oldTHETAj=THETAj; 

whi le  solvedssQ 

n=n«-1  X  to  track  just  how  few  iterations  it  needs 
for  r=1:R 
r 

X  Reducing  conflicting  proto  neighborhood  size 
X  computing  1  x  J  conflict  logical  vector 
X  conflict(j)  can  s  1  even  if  jth  neighborhood  does 
X  not  overlap  current  feature  vector 
if  K==1 

conf I ict=(class-=nlout(r, ;)'*ones<1,J)); 
else 

conf I ict=any(class-snlout( r, : ) '*ones(1 , J)); 
end 

X  actual  size  reduction  where  needed 

THETAj=conf lict.*min( [THETAj;alpha*distance(r,:)-resolution] )  +  (-conflict). *THETAj 
X  except  it's  no  smaller  than  thetamin... 

THETAisinax([THETAj;thetafflin*ones(1,J)] ); 

X  nor  bigger  than  thetamax 
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THETAj>inin(  [THETAj;thetainax*on«s(1 ,  J)]  >; 

X  Classification  and  testing  of  unidentifiedness 
X  classify  and  display 
Oj=(distance(r, :)-THETAj<=0); 

X  dispd'The  first  'mmZstrdC)  ‘  rows  of  the']) 

X  dispCfollowing  are  the  prototype  classes.') 

X  dispCThe  last  row  is  the  prototype  output.') 

X  disp( 'There  are  J  columns.') 

X  Iclass;Oj] 

X  test-'-assunes  classes're  binary  w/  1  nonO  k-element 
nc I asses=sum( sunf ( c I ass . * ( ones ( K , 1 ) *0 j ) ) ' ) ' ) ; 
if  nclasses>1,  dispClt  is  confused.'),  end 
if  nclassess3l,  dispCIt  is  identified.'),  end 
if  nclasses==0 

dispCIt  is  unidentified.') 

X  Commitment  of  a  new  prototype  unit 
J=J+1; 

X  the  class  of  the  new  proto 
classs [class, nloutlr,:)']; 

X  the  neighborhood  size  of  the  new  proto 
TNETAj(J)smsx(min(alpha*di8tance(r,:)*resolution),thetamin); 

THETA j (J )«min(THETAj (J ) , thetamax); 

X  the  centroid  of  the  new  proto 
Wji=tWji,nlin<r,  :)']; 

newdistances=sqrt(sifn(nlin' .''2)'  ♦  ones(R,1)*sijn(Uji(:,J).''2)  •  2*nlin»Wji(;,J)); 
di stances [di stance, newdi stances] ; 
end 
end 

X  Is  it  solved? 

if  size(THETAj)sssize(oldTHETAj) 

if  THETA j==oldTHETAj,  solved=1;  end 
end 

oldTHETAjsTHETAj; 

end 

X  Updating  the  Wkj  connections  to  reflect  the  newly  added 
X  and/or  modified  prototype  units 
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D.  Modified  BEP 


In  Section  V.  C.  we  posed  a  new  algorithm,  one  which  enables  learning  in  a 
radial  basis  function  network  using  gradient  descent.  As  in  the  other  algorithms,  we 
adopted  a  three-module  “frame-init-train”  approach.  The  .mat  files  containing  the 
training  data  are  identical  in  form  to  those  used  for  the  back  propagation  simulations, 
differing  in  name  only. 

eframe.m 
X  eframe  script 

disp<‘This  lets  you  run  einit  and  ectrain  by  having') 
dispCyou  put  in  all  the  necessary  variables  beforehand.') 

J=input{ 'Enter  J:  '); 
icseed=input( 'Enter  icseed:  '); 

spreads input ('Enter  the  initial  condition  spread:  '); 
etasinputt 'Enter  eta:  '); 
laiibda=input( 'Enter  lambda:  '); 
xisinput( 'Enter  xi;  '); 

N=input('Enter  N:  '); 

0Ns{nput( 'Enter  ON:  '); 

msetolsinputC 'Enter  the  mse  below  which  training  stops:  '); 
nofftol=input< 'Enter  the  noff  at  which  training  stops:  '); 
a= input ('Enter  the  upper  (and  'lower)  bound:  '); 
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einit.m 


X  einit  script 
load  ein 
load  eout 
l^ncolsCein) 

K^ncolsleout) 

Rsnrows(eout) 
if  exist('keepCji') 

Cji=keepCji;  SIGMAji<kaepSIGMAji; 

Wkj=keapUkj;  THETAk=keapTHETAk; 
else 

X  Initialize  the  weights  and  biases 

X  properly  chosen  centers 

rand< 'uniform' ) 

rand( 'seed' , icseed) 

rand(1,3);  X  to  exercise  generator 

Cji=rand(l,J);  X  centers  confined  to  presuned  input  space 
SIGMAji2rand(l,j)*spread;  X  positive  mmbers 
Wkj=(rand(J.IC)-1  ./2)*spread; 

THETAk=(rand(1.lO-1  ./2)*spreod; 
keepCji=Cii;  keepSlCMAji*SIGMAji; 
keep«kj*Ukj;  keepTHETAk=THETAk; 
end 
n=0; 

X  MSE  CALCULATION 
X  Forward  Pass 

METi»ein.‘2*SIGNAii.*<-2i*one8(R,I)*(Cji,/SIGHAji).''2-2*ein*(Cji./SlGHAji.^2); 

Oj«exp(-NETj); 

NETk=  tOj .onesCR, 1 )]• tWkj ;THETAkl ; 

Ok=1  ./(1*exp(-NETk)); 

Idetmsef 1 )>8un(Meanc( (Ok-eout ) . '2) ); 

X  literally,  less  detailed  mse 
X  this  first  one's  not  an  average 
noff(1)«sun(8um((abs(0k-eout)>0.4))); 
nof  f  2  { 1  )=sifn(  s(jn(  ( abs(Ok  -  eout  )>0 . 2 ) )  )  ; 
clear  ans  x  X  temporary  variables 

ectrain.m 

X  ectrain  script 

X  Accunulstes  weight  change  over  epoch,  THEN  updates... 
stop»n+N; 
while  n<stop 
n=n*-1 

X  Error  back  propagation 
Dk*(eout-0k).*0k.*(1-0k); 

OWkj»eta»Oj'*Ok; 

DTHETAk»eta*one8( 1 , R )*0k; 

Dj«-Oj.*(Dk*Wkj'); 

OCji»-2*ltmbda*SlGMAji.*(-2).*(ein**Dj-Cji.*(ones{l,R>*Oj)); 

DSIGNAji«-2*xi*SIGMAji.''(-3).*(ein.*2'*0j-2*<ein’*Dj).*Cji^(Cji.*2).*{one8(I,R)*0j)); 
X  Addition  of  the  deltas  to  electronic  register 
Wkj«Wkj*OWkj ;  THETAk»THETAk*OTHETAk; 
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Cji=Cji+DCji;  SIGMAjisSIGHAji-i-DSIGNAji; 

X  Only  clipping  remains... 

Wlcj=-(Wkj<-a).*a+{abs(Ulcj)«a).*Wkj+(Wkj>8).*a; 

THETAk=-(THETAk<-a).*a+(8bs(THETAk)<=a).*THETAk+(THETAk>a).*a; 

Cji=-(Cj  i<-a).»a+<8bs(Cji  )<=a).*Cji+(Cji>a).*a; 
SIGMAji=-(SIGMAji<-a).*a+(abs(SIGMAji)<=a).*SIGMAji'»(SIGHAji>a).*8; 

X  Forward  Pass 

NETj=ein.''2*SlGMAji.*(-2)+ones(R,I)*<Cji./SIGHAji).*2-2*ein*(Cji./SIGMAji.*2) 

Oj=exp(-NETj); 

NETk=  [0  j ,  ones(R .  1)]  *  [Uk  j ;  THETAk]  ; 

Ok=1  ./(1*exp(-NETk)); 

detnise=  [detmse,  sijn(ineanc(  (Ok- eout )  .''2) )] ; 

if  rem(n,DN)==0 

I  detmseC  n/DN'*’ 1)sme8n(detmse) ; 
nof f ( n/DN+ 1) =sum( sum( ( abs ( Ok - eout ) >0 . A ) ) ) ; 
nof  f  2(  n/DN-*' 1)«sifn(  sun(  (  abs  (Ok  -  eout  )>0 . 2  )  )  ) ; 
clear  detmse 

displ 'convergence  check') 

X  based  on  mse  over  r,  not  individuals 
if  ldetinse(n/0N''’1  )<zinsetol 

dispCless  detailed  mse  within  tolerance') 
break 

elseif  noff (n/DN+1 )<=nofftol 

disp( 'nifitier  misclassified  within  tolerance') 
break 
end 
end 
end 

X  clear  stop  Ok  Oj  X  temporary  variables 
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